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PREFACE 


This edition has been completely rewritten. The main features of the new edition 
are the following: 


1. The mathematical, statistical threshold has been lowered in order to make the 
material more accessible to students with only an elementary prior knowledge 
of statistics. This has resulted in a somewhat larger proportion of words to 
symbols in the early chapters than would otherwise have been the case. A 
series of paragraphs on mathematical, statistical topics has also been provided 
in Appendix A. These are keyed into the early chapters to ease the transition 
into the heart of the book. For the same reason the chapter on matrix algebra 
has been retained and, indeed, expanded to include a geometric as well as an 
algebraic treatment of some topics. 

2. All the inference procedures for the general linear model have been derived as 
special cases of a single basic procedure, namely, the testing of a set of linear 
restrictions on the parameters of the model (Chapter 5). This leads in turn to 
an exhaustive treatment of tests for structural change (Chapter 6). Chapter 6 
also contains extended treatments of the use of dummy variables and of 
multicollinearity among the regressors. 

3. Every effort has been made to cover both new and old topics on which 
substantial work has been done in recent years and which are thought to be 
significant and enduring rather than passing fancies. Such topics include the 
estimation of sets of equations with special reference to transcendental 
logarithmic approximations and applications in energy economics (Chapter 8), 
autocorrelated error terms (Chapter 8), time series techniques (Chapter 9) and, 
in A Smorgasbord of Further Topics (Chapter 10), the following “menu”: 
recursive residuals, spline functions, pooling of time-series and cross-section 
data, variable-parameter models, qualitative dependent variables, and errors 
in variables. The author has also granted himself the indulgence of some 
personal comments on the present state of econometrics (Chapter 12). 


vii 


viii PREFACE 


4. The problem sets have been extended to become truly Anglo-American, with 
offerings from the Royal Statistical Society plus the universities of Cambridge, 
London, Manchester, and Oxford on one side of the pond, and Chicago, 
Michigan, Yale, and Washington on the other side. Grateful acknowledgment 
is made to various anonymous authorities for the first set, and to Arnold 
Zellner, Jan Kmenta, Peter Phillips, and Charles Nelson for the second. 
Appendix B also contains an extensive set of statistical, econometric tables, 
and grateful acknowledgment is made to the appropriate sources for permis- 
sion to publish them. 


My debts to many individuals can be warmly acknowledged but never fully 
recompensed, Craig Riddell (University of British Columbia) read the entire 
manuscript and contributed many valuable comments. I am very grateful to both 
of them. My thanks also go to Ian McAvinchey (University of Aberdeen) and to 
my colleagues Ken Chomitz, Max Fry, and Charles Lave (University of Cali- 
fornia, Irvine) for a similar service. Ken Chomitz has also produced a solutions 
manual for comments on various chapters. This in some places is almost a 
supplementary text in that extended solutions have been written for various 
problems, outlining particular issues that could not be dealt with in the main text. 
Copies of the solutions manual are available to instructors on application to the 
publishers. Kathy Alberti and Barbara Sawyer did a magnificent job on a difficult 
manuscript. In addition, Barbara Sawyer did the preliminary artwork, prepared 
the tables in Appendix B, and proofread the entire manuscript. Finally, I 
gratefully acknowledge the questions and suggestions from teachers and students 
in many parts of the world during the two decades since the first publication of 
this book. I can only hope that this third edition will inspire a similar response. 


J. Johnston 
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THE NATURE OF ECONOMETRICS 


Before asking the question, “What is econometrics?,” one must pose the prior 
question, “What is economics?” The answer to the second question will indicate 
the role that econometrics can play in the development of economics. Although 
the focus of the exposition in this chapter will be on economic models, the 
methods that have been developed in econometrics can and do play an important 
role in other social sciences, where there is a concern with building and estimating 
models of the interconnections between various sets of variables in a predomi- 


nantly nonexperimental situation, 


1-1 ECONOMIC MODEL BUILDING 


Economists seek to understand the nature and functioning of economic systems. 
Their concerns may relate to global aggregates, or macro quantities, such as the 
value of the gross national product (GNP), the level of employment, or the 

ce index. Alternatively, the focus of attention 


current level of the consumer pri i 
may be some sector or area of the economy, such as production and employment 


in the automobile industry or the price and volume of the peanut crop in Georgia. 
One objective of such an understanding is to be able to make conditional 
predictions of the likely future development of the system and hopefully enable 
economic agents, whether government, business, or consumers, to take action to 
control to some degree the evolution of the system. Another important objective 


is to test economic theories about the system. 


2 ECONOMETRIC METHODS 


The first step in seeking to understand the functioning of a system is to build 
a theoretical model. All models are inevitably simplifications of reality, and the 
model builder seeks to capture the fundamental features of the system being 
studied. The performance of an economy, or a sector of an economy, at any point 
in time will depend upon the decisions of various economic agents, taken in the 
context of the existing state of technology with given stocks of capital, labor, and 
other limited productive resources. Thus theoretical models typically contain 
behavioral relations, which describe the forces thought to determine the behavior 
of various groups of economic agents, and technological relations, which describe 
the restrictions imposed by the current technology and endowments of the system. 
Often technological relations, such as the production function, describing the 
maximum output achievable with various inputs of capital, labor, and other 
productive resources, may not appear explicitly in the model, but will have been 
used in the derivation of behavioral relations, such as the demand function for 
labor, and so on. In addition to behavioral and technological relations, economic 
models typically contain identities or definitional relations. 


1-2 A NATIONAL INCOME MODEL 


As an example of the model building process let us consider one of the simplest 
forms of the national income model, which is used as a pedagogic device in most 
elementary textbooks on economics. Such models begin with the national income 
identity. For a closed economy with no foreign trade, this identity in any period is 


y=ctitg (1-1) 
where y= gross national product (GNP) 
c=consumption expenditure 


i=investment expenditure 
= government expenditure 


all expenditure flows being measured in real terms. The construction of the model 
proceeds with the formulation of hypotheses about the determinants of the 
expenditure components of GNP. 

Consumption expenditure might be hypothesized as dependent on disposable 
income, net of tax, and the rate of interest. Thus we write} 


c=f((l-7)y,r) (1-2) 


where r= tax rate (assumed constant across the economy) 
r=rate of interest 


The theoretical expectations about this relation are 
OS fpslincetf;-< 0 (1-3) 
where /; indicates the partial derivative of the function with respect to the ith 


} See App. A-1, Functions and Derivatives. 
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argument. The first assumption in Eq. (1-3) is that the marginal propensity to 
consume out of disposable income is a positive fraction less than unity. The 
second assumption is that a rise in the rate of interest will have a depressing effect 
on consumption since it raises the return on savings, increases the cost of 
financing consumer durables, and also reduces the nominal value of bonds, which 
are a part of wealth, which in turn might appear as an argument of the 
consumption function but has been omitted from Eq. (1-2) on grounds of 


simplicity. 
The investment function may be specified as 
i=f(Ay,r) (1-4) 
with 
f>0% <0 (1-5) 


The term Ay indicates the change in GNP. Investment is positively influenced by 
profit expectations, and the crude assumption here is that observed changes in 
real GNP serve as a proxy for these profit expectations. The rate of interest is 
again expected to be negatively related to this form of expenditure. 

Collecting results, so far we have a three-equation model, namely, 


yectrit+g 
e=f((1-7)y,r) 
i=f(Ay,r) 


supplemented by the expected signs on derivatives expressed in Eqs. (1-3) and 
(1-5). This model then constitutes a theory about the joint determination, or 
“explanation,” of the three variables c, i, and y. Such an explanation is obviously 
conditional on the values assumed for g, r, and tr. The model builder now faces a 
decision on how to treat these remaining variables. Should one formulate theories 
to explain the determination of government expenditure, the rate of interest, and 
the tax rate, thus expanding the system to one of six equations? If one does, the 
new equations will almost certainly contain some explanatory variables on the 
right-hand side that have not previously appeared in the system, and these, in 
turn, raise the question of how they are to be treated. It might seem that economic 
models must become infinitely large, but there is not, of course, an infinite 
number of variables to be explained. In any case the behavior of model builders is 
very pragmatic. Everything is relative: all depends on the problem at hand. For 
some purposes a small model is sufficient and some variables, which in larger 
models would have explanatory equations, may be left “ unexplained.” In the 


present instance we make no pretense at economic realism, but only require a 


model for illustrative purposes, So We will restrict it to the three equations already 
specified. 

The model contains only two behavioral relations, one for consumption and 
the other for investment. Economic theory has done two things. First, it has 
specified the list of explanatory variables on the right-hand side of each equation, 
and second, it has indicated the expected signs on the partial derivatives. This is 
usually as far as theory per se can go, but it still leaves a series of important 


questions unanswered. 


4 ECONOMETRIC METHODS 
1-3 UNANSWERED QUESTIONS 


Functional form Theoretical considerations alone cannot usually specify the 
functional form connecting the variables in a relationship. Many functional forms 
are consistent with a priori signs on derivatives. Letting 

z=(1-—7)y 
denote disposable income and omitting the rate of interest variable, the following 
functional forms all give c as a monotonically increasing function of z and, with 
appropriate restrictions on parameters, could satisfy the condition that the 
marginal propensity to consume is a positive fraction: 

C= Ay t+ Hz 

c= Az" 

C=a)—a,2_! 
These functions, however, have different qualitative implications. In the first, an 
extra $100 of income always produces the same absolute increase in consumption 
expenditure. The second and third functions both exhibit a declining marginal 
propensity to consume as income rises. However, the second function implies that 
consumption rises indefinitely with income, while the third shows consumption 
approaching a saturation or asymptotic level a, as income becomes very large. 
This is a typical example of the fact that the qualitative restrictions deriving from 
economic theory do not serve to delimit functional forms very closely. 


Data definition and measurement Theory is sometimes precise and sometimes 
sloppy in the matter of definitions. In this model, for instance, should consump- 
tion be taken to mean expenditure, including actual expenditure on consumer 
durables, or should consumption of durables be treated as an implicit flow 
measured by the value of services from the existing stock of consumer durables? 
If the second definition is taken, is this consistent with the definition of the same 
variable in the national income identity? What is meant by income? Should it be 
adjusted for purely seasonal fluctuations or not? Is it to be taken as some recently 
observed level, or should it be interpreted as some kind of “permanent” or “long 
tun” income? There are many different rates of interest. Should we select a 
“representative” rate or some combination of rates, and should this variable be 
treated the same way in both consumption and investment functions? 


Lag structure Somewhat allied with problems of data definition are problems of 
lag structure. Should investment be specified as responding to the current interest 
rate or to some set of previous interest rates in view of the inevitable time lags 
involved in making and implementing investment decisions? Again, by the nature 
of things, economic theory cannot be specific about appropriate lag structures. 
Moreover, much of economic theorizing has necessarily been about equilibrium 
positions, as, for example, the equilibrium rate of consumption corresponding to 
some level of income, which has, in theory, remained constant long enough for 
consumers to become fully adjusted to it. In practice, the world is always 
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staggering from one disequilibrium position to another, so actual data reflect 
adjustment processes rather than equilibrium positions. Equilibrium theory, by 
definition, says nothing about adjustment processes, and theories of adaptation 
and adjustment are still in a fairly primitive state. 


Qualitative versus quantitative implications The theoretical model does yield 
unambiguous qualitative implications, such as that a rise in the rate of interest 
will depress GNP and a rise in government expenditures will increase it. In more 
complicated models qualitative conditions on the various equations may not lead 
to unambiguous predictions about the overall behavior of the model. If our 
simple model asserted that the rate of interest had a positive effect on consump- 
tion and a negative effect on investment, the direction of the rate of interest effect 
on GNP could not then be known without quantitative knowledge of the two 
separate effects and the magnitudes of consumption and investment. In practice, 
of course, policymakers are vitally concerned with the likely magnitude and timing 
of the effects of changes in the rate of interest, tax rates, or government 
expenditure. The expected signs of partial derivatives cannot provide this kind of 
information. 


Choice between theories So far, in discussing the previous four problems, we have 
implicitly assumed that our theoretical model is “correct,” but how can we tell 
whether a theory is sufficiently correct to be used as a valid tool of analysis? 
Perhaps there are as many theories as there are theorists. There is, in practice, a 
very important and very difficult problem involved in attempting to discriminate 
between competing theories. Some theoretical models differ in degree but not in 
kind. They might be regarded as variations on a theme. For example, another 
theorist might accept the general form of our consumption and investment 
functions but wish to add wealth as an additional explanatory variable to the first 
equation and capital stock to the second. At the other end of the spectrum would 
be a theorist who rejected the Keynesian flavor of our model and advanced 
instead a supply-determined theory of output or a model in which the fundamen- 


tal driving force was the money supply. 


1-4 ROLE OF ECONOMETRICS 


Econometrics tackles all five questions. Its basic task is to put empirical flesh and 
blood on theoretical structures. This involves several crucial steps. First of all, the 
theory or model must be specified in explicit functional form. The econometrician 
does not have any special insights in this area that are denied to the economic 
theorist, so one usually starts with the simplest functional forms that are con- 
sistent with the a priori specifications. At the same time one makes an initial 


specification of the lag structure. As an example we might specify the three-equa- 
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tion national income model as 


= A + a (1 —7)y, + Or, (1-6) 
i, = By + B\(%-1 — %-2) + Boni (1-7) 
N= eC, +i, + 8, (1-8) 


with a priori expectations 
O<a, <1, -a,<0, B,>0, B, <0 


The subscripts on the variables refer to time periods. The unit time period can be 
anything considered relevant by the econometrician, provided there exist ap- 
propriate data in terms of that unit. However, it is typically a quarter or a year, 
and the model is in discrete, not continuous, time. 

The second task of the econometrician is to decide on the appropriate data 
definitions and assemble the relevant data series for the variables which enter the 
model. The third task is to perform a “marriage” of theory and data by means of 
statistical methods. The “offspring” of the marriage are various sets of statistics, 
which shed crucial light on the validity of the theoretical model that has been 
specified. The most important set consists of the numerical estimates of the 
parameters of the structural form. The Greek letters of Eqs. (1-6) and (1-7) are 
now replaced by numbers. There are further statistics which enable one to assess 
the reliability or precision with which these parameters have been estimated, 
which in turn helps us to check whether the model conforms to the theoretical 
expectations about signs of derivatives. There are still further statistics and 
diagnostic tests that help one to assess the performance of the model and decide 
whether or not to proceed sequentially by modifying the specification in certain 
directions and testing out the new variant of the model against the data. 

Most of this book will be concerned with the statistical methods used by 
econometricians in estimating, testing, and evaluating economic models. Histori- 
cally, econometrics started with the corpus of methods inherited from classical 
Statistics. These methods, however, were mainly developed in the context of the 
experimental sciences. Special problems of statistical inference arise in economics, 
where the possibility of controlled experiments is the exception, not the rule, and 
these will be described in the chapters to follow. All that remains to be done in 
this introductory chapter is to indicate some of the possible applications of an 
econometric model, once it has been estimated. This will again be done with the 
simple model outlined above. 


1-5 STRUCTURAL AND REDUCED FORMS 


Equations (1-6) to (1-8) constitute the structural form of the model. The structural 
form may be regarded as a theoretical explanation, or hypothesis, about the 
determination of the three variables y,, c,, and i,, conditional on the values 
currently assumed by g, and r, and also on the recent history of the system as 
represented by y,_;, ¥,->, and r,_,. This enables us to make the following 
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classification of the variables in the system: 


Current endogenous variables: Bish 
Lagged endogenous variables: Vp Vr-2 
Current exogenous variables: Bol 
Lagged exogenous variables: Lae 


The crucial distinction is between endogenous and exogenous variables. The 
former are those variables whose current values are, in theory, explained by 
the functioning of the model. The model, however, has nothing to say about the 
determination of the exogenous variables. A second important distinction is that 
between the current time period ¢ and previous periods, such as ¢ — 1, ¢ — 2, and 
so on. When we come to study the functioning of the model in period 1, all lagged 
values, whether of endogenous or exogenous variables, are already given and 
cannot now assume new values. Once values are also fed in for the current 
exogenous variables g, and r,, the model then delivers the values of the current 
endogenous variables ¢,, i,, and y,. This point may be expressed formally by 
recasting Eqs. (1-6) to (1-8) in an alternative form. Substituting Eqs. (1-6) and 
(1-7) in Eq. (1-8) and rearranging gives 


Y= (a + Bo) 8 + «87, + B\8(,-1 — Y-2) + Brbn-1 + 88, (1-9) 


where 
a 1 


ie aor) 


The important point about Eq. (1-9) is that only one current endogenous variable 
appears in the equation, namely, y, on the left-hand side. The right-hand-side 
variables are a mixture of current exogenous variables and lagged variables, 
whether endogenous or exogenous. This collection of three sets of variables is 
labeled the class of predetermined variables, since, from the viewpoint of the 
model in period f, their values either are determined by the past history of the 
system or are set exogenously in the current period. The investment equation 
already has nothing but predetermined variables on the right-hand side, so we 
repeat it here: 

i, = By + Birr = Ja) + Bota (1-10) 


Finally, substituting Eq. (1-9) in the consumption function gives 
¢, = [eo + (1 — T)(a + Bo) 8] + [a, + a, (1 - 7)a28|7, 
+a,(1 — 7)B8(%-1 = y,-2) + uC — 7)B,8r,_, + (1 — 7) 8g, 
(1-11) 


The three Eqs. (1-9), (1-10), and (1-11) constitute the reduced form of the 
model. Each equation of the reduced form expresses a current endogenous 
variable as a function only of predetermined variables. The reduced form may be 
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Inputs Time period © Output 


Exogenous variables 
current and lagged =~. 


Predetermined Current endogenous 
: oe : 
variables variables 
Lagged endogenous. ——~ 
variables 
Figure {-1 


written compactly as 


Y= Mo F MB, + Mh +M3%_ 1 + MV. + M5 Y:-2 (1-12) 
Cp = My + M18, + Mh + M3h% 1 + MY) + M52 (1-13) 
i, = %o H+ 33% 1 + 4M) + 5,2 (1-14) 


where the 7’s are the functions of the structural parameters indicated in Egs. 
(1-9) to (1-11). Schematically, the reduced form is indicated in Fig. 1-1. 

The reduced form also indicates that there is one-way causation in the model 
in the sense that the exogenous variables influence the current endogenous 
variables, but there is no feedback in the opposite direction: current endogenous 
variables do not influence the exogenous variables. 


1-6 MULTIPLIERS AND DYNAMIC PROPERTIES 


The 7’s of the reduced-form equations are economically very important parame- 
ters. They measure the impact in the current period on each endogenous variable 
of a unit change in any predetermined variable. Consider, for example, a unit 
increase in the level of g,. From Eq. (1-8) of the structural form there would be a 
simultaneous increase of one unit in GNP. But from the consumption function 
(1-6), increases in GNP will induce increases in consumption, which in turn, from 
Eq. (1-8), will induce further increases in GNP. The reduced-form coefficient 


ay, 1 


chs iaeaaiediag —a,(1— 7) 
shows the end result of this process in period 1. This is the national income 
multiplier of simple Keynesian theory. For example, if r = 0.25 and a, = 0.8, 
™ = 2.5, so that a unit increase in government expenditure, with tax rates and 
all other parameters unchanged, would raise national income in the same period 
by 2.5 units. Similarly, an inspection of 


To = (a + By)d 


shows that a unit increase (upward shift) in the intercept of either the consump- 
tion or the investment function would have equal multiplier effects on GNP. All 
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the m’s are multipliers, and they are termed impact multipliers, because they show 
the effect in the current period of changes in predetermined variables. Estimates 
of the structural coefficients can yield estimates of the reduced-form coefficients, 
and so these impact multipliers can be evaluated. Alternatively, the reduced-form 
equations may be estimated directly. These topics will be discussed in the chapter 
on simultaneous equation estimation later in the book. 
The impact effects in period 7 are not the end of the story. Let us write Eq. 
(1-12) in first difference form, 
Ay, = mAg, + mdr + ma Ana + MAY + MAY-2 (1-15) 
where 
AY, =I ~My 
Let us suppose that g and r have been held constant sufficiently long for y to settle 
down at some constant equilibrium level. This involves the implicit assumption 
that equilibrium values exist and that the system is stable, and we will return to 
this point below. Equilibrium thus implies 
Ag, = Ag). = A8-2 = °° = 
Ar, = Ar) =An-2 = 77° = 
Ay, = Ay,-2 = Ay-2 = "** = 


Now suppose that the level of government expenditure in period 1 + 1 is raised 
by an amount d and then held constant at the new level indefinitely, that is, 


Agi, = 4, Agia. = A843 = °° = 0 
From Eg. (1-15) the impact effect on national income in period ¢ + 1 is 
Aa = md 


In period ¢ + 2, Eq. (1-15) reads 
Ayer = MA Deri = mam 4 
In period ¢ + 3, the equation reads 
Ayres = MAI + MsAIe1 
= mgm d + msm 14 
Thus the one-step change in g sets off a sequence of changes in y because of the 
lags in the system. There is thus a whole series of lagged multipliers, namely, 


yea 
98141 


=m zero lag, or impact multipliers 


9Yi42 2 maT one-period lag 
8841 


oye (a, + Ms)71 two-period lag 
98141 


The estimated reduced form can be applied sequentially to trace out the dynamic 
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effects of a postulated change in any exogenous variable. The multipliers at 
various lags are called interim multipliers, and if the impact and interim multi- 
pliers are summed over an infinite time horizon, assuming that the sum converges, 
we have a total multiplier giving the final effect on the equilibrium value of an 
endogenous variable of a one-step change in an exogenous variable. 

Finally, we may look briefly at a third way of expressing the system, which 
casts light on the question of stability. Equation (1-12) may be rearranged as 


Ye MAY-1 — MsYe-2 = Mo + MS, + MM + 3h) (1-16) 


This is a second-order nonhomogeneous difference equation in y.} It is a fluke of 
this simple model that this reduced-form equation did not contain lagged values 
of any other endogenous variable, but it is always possible to derive a difference 
equation for each endogenous variable, which contains only exogenous variables 
on the right-hand side and does not contain any other endogenous variables, 
current or lagged. Equation (1-16) may be expressed more simply for present 
purposes as 

Ye MaY-1 — MsY-2 = f(8,7) (1-17) 
A remarkable feature of linear dynamic models is that each endogenous variable 
in the system can be described by a difference equation of the same order and 
with identical coefficients, but differing only in the linear combination of exoge- 
nous variables appearing on the right-hand side. To demonstrate this result in the 
present model, we take the investment function 


1, = By + B\(Y-1 — %-2) + Bon) 
Lag it one period and multiply by 7,4, lag it two periods and multiply by 7,;, and 
subtract both equations from the current equation. This gives 
be Mali — Mshe—2 = B\( Y=) — M4 H-2 — Ms i—-3) 

=B\(%-2 — M43 — T15I,-4) 

+ Bo(1 = m4 — Ts) 

+B, (11 = Mah—2 — TMsh—s) 
The first two terms in parentheses on the right-hand side are seen from Eq. (1-17) 
to be functions only of exogenous variables. Thus 


Ip Mal,_) — Msi. = h( 2,7) (1-18) 
Finally, applying the same treatment to the national income identity gives 
o> Malr—1 — M562 = (Y; — Ma Y:-1 — M5r-2) 
SERS Tal,—1 — Tsi,-2) 


— (8 = M48:-1 — M82) 
and, using Eqs. (1-17) and (1-18), we have 
1 — Malr—1 — MsCr_2 = k(g, 7) (1-19) 


} For an introduction to difference equations, see A. C. Chiang, Fundamental Methods of Mathe- 
matical Economics, McGraw-Hill, New York, 1984. 
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Thus all three endogenous variables are characterized by a second-order difference 
equation with the same coefficients. The endogenous variables will therefore 
display the same dynamic behavior. The clue to this behavior comes from the 
roots of the common characteristic equation 

N= m4 — ms = 9 


The structural and reduced-form equations show 


By 
= -m.=—— 7 7 9 
™4 is a Ba a(1—7) 
Denoting this parameter by a, the characteristic equation becomes 
M-ak+a=0 
with roots 
a + ya(a — 4) 
NSA Fee a, eves 


If « < 4, the roots are complex and the economic structure is inherently cyclical. 
The product of the roots is also a, and so if a < 1, the cycles are damped, but if 
1 <a <4, the cycles are explosive. We see that a depends on 


B,, the acceleration coefficient of the investment equation 
,, the marginal propensity to consume 
7, the tax rate 


Thus, once again, empirical estimates of these parameters shed crucial light on the 
nature of the economic structure. The above model has been highly simplified for 
expository purposes, but these methods of analysis can be and are applied to large 
systems. We have attempted to illustrate the importance of econometric estima- 
tion and testing by reference only to a simplified aggregate system. Other varied 
illustrations of the power and range of econometrics will be given in the course of 


the book. 


CHAPTER 


TWO 
THE TWO-VARIABLE LINEAR MODEL 


The national income model of Chap. 1 has two complications that we do not wish 
to tackle right away. First of all, it is a simultaneous equation model with three 
equations to explain the determination of three endogenous variables. Second, 
each behavioral equation contains more than two variables. We will, however, 
begin our exposition of econometric methods by concentrating upon a single 
equation with just two variables. It is not claimed that a single two-variable 
equation is an adequate model of any economic process, but starting with it has 
the double advantage that certain fundamental ideas can be introduced in the 
simplest of all settings and that the tools and concepts developed for the 
two-variable model are essential building blocks for the more complicated cases 
which are treated in the rest of the book. 


2-1 THE LINEAR SPECIFICATION 


The relevant theory is now assumed to postulate 
Y=/(Xx) (2-1) 


where Y indicates the dependent (explained) variable and X the independent 
(explanatory) variable. We may have theoretical expectations about the sign of 
f(X) or about the range of values in which it lies. In this chapter we will deal 
only with /inear specifications. 


2 
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A linear specification means that Y, or some transformation of Y, can be 
expressed as a linear function of X, or some transformation of X. In this sense, 


Y=a+t px (2-2) 
Y = aX? (2-3) 
and Y= exp at e5} (2-4) 


are all linear specifications.} The first is already linear in Y and X. The second, on 
taking logarithms of both sides of the equation, may be written as 


log Y = log a + Blog X (2-5) 
which is linear in log Y and log X. The third is 


inva wie By (2-6) 


which is linear in the logarithm of Y and the reciprocal of X. The function 
Y=a+ BX + yX* 


is linear in Y, X, and X?, but it is not a two-variable linear function, and so its 
treatment will be postponed to Chap. 3. The function 
=at+zs— 
Y=a X-B 
however, where a, 8, and 6 are unknown parameters, cannot be reduced to a 
linear function of some transformations of Y and X, and so cannot be treated by 


the methods of this chapter. 
The first step in the econometric investigation of the relationship between Y 
and_X is to obtain a sample of n pairs of observations on the two variables, The 


sample data are thus indicated by 
X,Y, i= 1,2,...50 


Next we must make a choice between specifications such as Eqs. (2-2), (2-3), and 
(2-4). At this stage the choice is made by plotting the raw data or various 
transformations of them on two-dimensional scatter diagrams to see which, if 
any, yields an approximately linear scatter. Examples of various typical shapes 
and appropriate linearizing transformations will be given in Chap. 3. Here we will 
assume that Y and X denote appropriately transformed data, and so we postulate 


the linear relationship 


€ 


Y=a+ BX 
where a indicates the intercept made by the line on the vertical, Y, axis and B 


indicates the slope of the line. 

The econometrician now faces the task of using the sample data to obtain 
numerical estimates of the unknown parameters a and f. If the postulated 
relationship were really true, one would have no problems at all; one would need 


+ See App. A-2, Exponential and Logarithmic Functions. 
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just two sample points and a ruler to join them. Further sample points would lie 
on the same straight line and would convey no additional information. However, 
exact functional relationships such as Eq. (2-2) are inadequate descriptions of 
economic behavior. Scatter diagrams do not yield points which all lie on a single 
straight line. Thus the specification of the linear relationship is expanded to 


Y=a+pPpX+u (2-7) 


where u denotes a stochastic variable with some specified probability distribution.} 
The purpose of the wu term is to characterize the discrepancies that emerge 
between the actual, observed values of Y and the values that would be given by an 
exact functional relationship. 

To fix these ideas let us suppose that we have data from a budget survey with 
X representing household disposable income and Y household consumption 
expenditure, Clearly, household expenditure will depend on some crucial factors 
in addition to income, such as household size and composition, so let us suppose 
that we are looking at the relationship between Y and X within a subset of 
households of a given size and composition. Nonetheless it would still be 
unrealistic to expect all households with a given income X, to display exactly the 
same expenditure a + BX,. First of all, even among households of the same size 
and composition and with the same income, there will be variations in the precise 
ages of the parents and children, in the number of years since marriage, in 
whether the husband is a golfer, drinker, poker player, or bird-watcher, in 
whether the wife is addicted to spring hats, Paris fashions, swimming pools, or 
foreign sports cars, in whether the household income has been increasing or 
decreasing, in whether the parents are themselves the children of thrifty, cautious 
folks or carefree spendthrifts, and so forth. This list might be extended ad 
infinitum. Many factors may not even be quantifiable, and even if they are, it is 
not usually possible to obtain data on all of them. Even if it were, the number of 
variables would almost certainly exceed the feasible number of observations, so 
that no statistical means exist for estimating their influence. Moreover, many 
variables may have very slight effects so that, even with substantial quantities of 
data, the statistical estimation of their influence will be difficult and uncertain. We 
thus let the net effect of all these possible influences be represented by a single 
stochastic variable u. 

A second reason for the addition of the stochastic term is that there may be a 
basic and unpredictable element of randomness in human responses. For pur- 
poses of practical statistics the distinction between these two reasons for variabil- 
ity does not matter since, for reasons of both theory and data, we hardly ever 
claim to have included all distinguishable and relevant factors in any relationship, 
so that the insertion of a stochastic term is required on the first count, and the 
second merely adds to its variance. Finally, we note that if there were measure- 
ment errors in Y so that the recorded values did not accurately reflect the values 
given by the theory, this would also be a component of the stochastic term and 
add to its variance. 


} See App. A-4, Random Variables and Probability Distributions. 
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The variable u is often referred to as the disturbance term in the equation, or 
as the equation error. We cannot predict the specific value of u that will emerge in 
any single observation, but we can make propositions about the main features of 
its probability distribution. First of all, it is clear that the u’s may take on positive 
or negative values, since the net effect of the many omitted and unmeasurable 
variables may push Y up or down from the value it would otherwise have had. 
However, there is usually no reason to expect a bias one way or the other, so the 
first assumption about w is that its average or expected value will be zero, that is, 

E(u) =0 

Second, since u is the algebraic sum of many different positive and negative 
effects, we expect numerically small values of u to be much more frequent than 
very large values, so that the distribution will be unimodal around some fairly 
small value of u. If we add the assumption of symmetry, then the modal value will 
coincide with the expected value of zero. Third, we will often assume a specific 
form for the probability distribution, and an appeal to the central limit theorem 
suggests assuming a normal probability distribution for u.} Finally, we postulate 
that the various values of u will be distributed independently of one another. In 
terms of the budget data, that amounts to saying that if one household displays a 
positive disturbance, this does not make a positive (or a negative) disturbance 
more likely for neighboring or any other households. Each disturbance is con- 
ceived as drawn independently from some normal distribution 


N(0, 07) 
The assumption that the values of u are drawn independently from a normal 
distribution with zero mean and variance o2 is written compactly as 
u ~ NID(0, 6?) 
where the symbol ~ means “is distributed,” and NID stands for “normally and 


independently distributed.” 

The specified model is illustrated graphically in Fig. 2-1. For a household 
with income X, the average or expected expenditure is given by a + BX,. The 
actual expenditure will be a + BX, + u;, where u; is a random drawing from 
N(0, a). The complete mathematical specification of the model is 


¥,=a+ BX, +4, i=1,2,...,n (2-8a) 

E(u;) = 9 for all i (2-8b) 
0 i+j,alli, j 

E(uju,) -(3: ee (2-80) 

p(u,) = N(0, 07) for all i (2-8d) 


Assumptions (2-85), (2-8c), and (2-8d) are a more extensive way of stating 
u ~ NID(0, 0?) 


+ See App. A-5, Normal Probability Distribution. 
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PAu) 


Figure 2-1 


The reason for splitting them is that some of our subsequent derivations only 
require assumptions (2-85) and (2-8c) and not the assumption of normality. The 
first part of assumption (2-8c) states that all possible covariances of the u’s are 
zero, and the second part states that the variances of the u distributions at each 
point in Fig. 2-1 are the same. 

The three unknown parameters of the model are a, 8, and 07. We now turn 
to methods by which these parameters may be estimated. 


2-2 LEAST-SQUARES ESTIMATORS 


An estimator is defined as a formula or method of estimating some unknown 
parameter, and an estimate as the numerical value resulting from the application 
of the formula to a specific set of sample data. We start with the n sample 
observations and plot them on a scatter diagram as in Fig. 2-2. 
We typically plot scatter diagrams in the positive quadrant, but this is purely 
a matter of convenience since many economic variables, such as inventory 
investment, the balance of trade, the real rate of interest, and so forth, can take 
on both positive and negative values. Any straight line drawn through this scatter 
of points may be regarded as an estimate of the hypothesized relationship 
Y= a + BX +u. A straight line is indicated by 
: Y=at+ox (2-9) 
where ¥ indicates the height of the line at any given value of X. Once the 
numerical values a and b have been set, the line is determined, and one such line 
is shown in Fig. 2-2. If the line has been drawn through the scatter, some data 


+ IE two variables are independently distributed, their covariance will be zero, but the reverse does 
not necessarily hold, except for normally distributed variables. See App. A-5, Normal Probability 
Distribution. 
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Y=a+ bx 


oO 


Figure 2-2 


points will lie aboye the line and some below. We define the residuals from the 
line by 
e=¥,-Y=Y¥,-a-6X, i=1,2,...," (2-10) 


fi 
so that any line generates a set of n sample residuals, 


It would now seem sensible to choose the line, that is, to choose the values of 
a and b, to make the residuals “small.” A possible criterion might be 


n 
Select a, b to make )) e, = 0 
i=| 
Using Eq. (2-10), this criterion gives} 
Le, = L(Y, - a ~ bX,) = 0 


which, on dividing through by n, givest 
Y=a+ bx (2-11) 


Equation (2-11) merely gives the condition that a and b should be chosen to make 
the line go through the point of means ( X, Y). Thus we could pass a line with any 
slope whatsoever through (X,Y), and it would make the algebraic sum of the 
residuals zero. The criterion is thus inadequate to determine a specific line. 

The least-squares criterion is stronger. If each residual is squared, negative 
signs disappear, and the sum of squared residuals is a nonnegative quantity. The 


least-squares principle is 


Select a, b to minimize Le? 


+ Where there is no ambiguity about the range of summation, we will use Le,, or sometimes just 
Le, instead of the more cumbersome Lf 1¢,, and similarly for other expressions. 
+See App. A-3, Operations with Summation Signs. 
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From Eq. (2-10), 
Ye? = DY, — a — bx,)” (2-12) 
Thus 
Le? = f(a, b) 


since the sample data are given, so that passing different lines through the scatter 
(that is, choosing different values for a and b) produces variation in the residual 
sum of squares. The necessary conditions for a stationary value are 


a(Xe?) _ a(Ze?) | ‘ 
ie. i) 


Applying these conditions to Eq. (2-12) gives} 
LY = na + bDLX 
LXY = aLX + br Xx? 


(2-13) 


These are termed the normal equations for the straight line, for reasons that will 
become clear when we discuss the geometry of least squares in Chap. 4. 

To estimate the line implied by Eq. (2-13) we first compute five quantities 
from the sample data, namely, 


ME ois AY) -i-ands) LX 


Substitution of the resultant numbers in Eq. (2-13) gives two simultaneous 
equations, which can then be solved for the two unknowns a and b. This gives the 


2 
SN ere ee 


gives LY = na + bLX, and 
Fy 2 
ua Me ~20X(¥ — a bX) = -2EXe = 0 


gives LEXY = aLX + bYX*. In obtaining the derivatives we leave the summation sign where it is, 
differentiate the typical term (¥, — a — bX,)? with respect to a and b in turn, and simply observe the 
rule that any constant can be moved in front of the summation sign but anything which varies from 
one sample point to another, such as X, and Y,, must be kept to the right of the summation sign. 
Finally, we have dropped the subscripts to leave the equations uncluttered, since there is no ambiguity 
about the range of summation. Strictly speaking, one should distinguish between the a and b which 
appear in the expression for the residual sum of squares 


Ee? =E(¥ —a— bx)? 
and the solution values for a and b obtained by solving the equations 
a(Ee?) _ a(Fe?) 
da db 
but again no ambiguity is involved, and we have kept the expressions as simple as possible. 
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Table 2-1 
x Y XY xa Y e-Y-¥ 
3 4 8 4 4.50 ~0.50 
3 7 21 9 6.25 0.75 
1 3 3 1 2.75 0.25 
5 9 45 25 9.75 -0.15 
9 7 153 81 16.75 0.25 
Sums 20 40 230 120 40.00 0 


least-squares regression of Y on X, namely, 
Y=a+bx 
or Y=Yte=at+bXt+e 


The following example illustrates the application of these techniques to the X, Y 
data in Table 2-1. 


Example 2-1 The normal equations are 
40 = 5a + 20b 
230 = 20a + 1206 


with solution 
a=1, b=1.75 


The regression of Y on X is 
Y=1+1.75X 


Substituting each sample value of X in the regression equation gives the Ve 
and e values shown in the last two columns of Table 2-1. Note that the Y 
values sum to the same total as the sample Y values, and the residuals, of 


course, sum to zero. 
The linear regression has a number of important properties. 


1. The regression line passes through the point of means X,Y (i.e., the sum of the 


residuals is zero). 


This follows directly from the first equation in Eqs. (2-13), which, on division 
by n, gives 
Y=a+ bX 


and it is also shown in the footnote on page 18. 


2. The residuals have zero covariance with the sample X values and also with the 


predicted Y values. 
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This also follows from the footnote on page 18 where 4(Le?)/db = 0 gives 
Xe = 0. The sample covariance between X and e, by definition, is 


cov( X,e) = AE(X— ¥)(e- 2) 


— sles le 


L(X-X)e — sincee#=0 


LXe - LF oy 
n 
= que since Le = 0 


Since ¥ is a linear function of X it follows directly that cov(Y, e) = 0. 


3. The regression coefficients may be computed sequentially from 


<x 
= me (2-14a) 
and 
a=Y-bxX (2-14b) 


where x and y denote derivations from sample means, 
x=X-X, y=yY-Y 


Equation (2-146) is merely a rearrangement of the first equation in Eqs. 
(2-13), and Eq. (2-14a@) follows from substituting Eq. (2-145) into the second 
equation of Egs. (2-13) to get 

LXY = (Y —bX )EX + BEX? 
giving 
ofex? = =(Ex)] = DXY — 4 (exer) 


or brx? = Ixy 

Alternatively, since the least-squares line passes through the point of means, 
we may take (X,Y) as a new origin, as in Fig. 2-3. Consider a point P with 
coordinates (X;, Y,). The first coordinate can also be expressed as x,, its distance 
from X. The second coordinate can similarly be expressed as y, and split into two 
components, namely, 

Y=I, + e; 

where eat Yr 
which is also a proper deviation since Y and Y have the same mean value. The 
regression equation can now be written 


y= bx 
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Sh eS a a eS 


oO 


>| 


Figure 2-3 


and the sum of squared residuals expressed as 
Le? =E(y- 9) 
= L(y — bx iF 
= Dy? — 2bYxy + b?Lx? 
which is only a function of b. Setting the derivative equal to zero gives 


The sum of squared residuals is seen to be a quadratic in b. The coefficient of bis 
Ex, which is necessarily positive (unless all X values were identical). The 
quadratic is thus U-shaped, and the stationary point must give the minimum sum 
of squares. 


4. Decomposition of sum of squares 


The total variation in ¥ may be expressed as the sum of just two components, 
the variation “explained” by the linear regression and the variation “ unexplained” 
by the regression. From property 3 we have 

Y= I, + €;, = bx; + e; 
Squaring and summing over all 7 observations gives 
Ly? = Ly? + Le? + Wye = WLx? + Le? + 2bExe 
or Ly? = Ly? + Le? = bEx? + Le? (2-15) 
saad , West tenga: 


La 44 


= } 


22 ECONOMETRIC METHODS 


since ~ and x each have zero covariance with e. The crucial quantities in Eq. 
(2-15) are 


Ly?=total sum of squares in the dependent variable, measured about its mean 
(TSS) 

Le? =residual or unexplained sum of squares (RSS) 

xy? =explained sum of squares (ESS) 


It follows from Eq. (2-15) that the explained sum of squares may be 


expressed in several alternative ways, 


_ Gy 


ESS = Df? = B°Lx? = brxy : 
Lx 


using Eq. (2-14a). 


Example 2-2 The data of Table 2-1 may be expressed in deviation form, as in 
* Table 2-2. Thus 


Bea 10) 
bm ® 1.75 
and a= Y~bX = 8 - 1.75(4) = 1 


as before, The explained sum of squares may be calculated as 
ESS = byxy = 1.75(70) = 122.5 
and the residual sum of squares may be obtained by subtraction as 
RSS = TSS — ESS = 124 ~— 122.5 = 1.5 
The proportion of the ¥ variation explained by the linear regression is 
ESS _ 122.5 


Tsg'aaing ~ 0788 


The u values underlying the sample data are unknown and unobservable, for 
we could only measure them if we actually knew the true values a and 8. Thus the 
variance of the disturbance distribution 07 cannot be estimated from a sample of 


Table 2-2 
x y xy x3 y 
~2 -4 8 4 16 
=i -l 1 1 1 
~3 -5 15 9 25 
1 1 1 1 i 
5 9 45 25 81 
Sums 0 0 70 40 124 
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u values. We do, however, have the regression residuals e,, e,..., ¢,, and it is 
plausible to base an estimate of the disturbance variance on them. Two alterna- 
tive estimators are 

Le? Le? 


wa! or 
n 1 labed 


Both are in use, but for reasons to be explained later in this chapter we typically 
use 
2 
e 
Ss i x 


“n= 2 


(2-16) 


2-3 THE CORRELATION COEFFICIENT 


The regression estimated from Eq. (2-13) fixes a line which passes through the 
sample scatter of points in the X, Y space. The correlation coefficient indicates the 
“closeness” of the scatter about the fitted regression line. A visual inspection of 
the scatter cannot indicate the degree of closeness, since changes in the units of 
measurement for X and Y can stretch or contract scatters to give very different 
impressions of the relationship. 

The correlation coefficient is defined as 


vat aot (2-17) 
nS,Sy 


where x and y denote deviations from sample means and s,, 8, are the sample 
standard deviations, 
mY Ly? 


= \j — = 
S. y n 


This is known as the Pearsonian (after the distinguished statistician Karl Pearson), 
or product moment, coefficient of correlation. Its rationale may be explained as 
follows. Referring to Fig. 2-3, the perpendiculars erected at X and Y divide the 
diagram into four quadrants. We pay particular attention to the sign of the 


product x,y, in each quadrant. 


NE quadrant xy positive 
NW quadrant xy negative 
SW quadrant xy positive 
SE quadrant xy negative 


Thus if we have a positive relationship, with sample points lying mainly in the NE 
and SW quadrants, Dxy tends to be a positive number. Conversely, a negative 
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relationship will generate points mainly in the NW and SE quadrants, with Dxy 
tending to be a negative number. If there is little, if any, relationship between the 
two variables, sample points will be scattered in all four quadrants, and Lxy will 
tend to zero. Exy, however, has two defects as a measure of association between ¥ 
and Y. The first is that its numerical value may be increased by simply adding 
further observations. This is corrected by dividing by the sample size to give the 
sample covariance 

by 


cov( X,Y) = 7 


Second, the covariance depends on the units in which X and Y are measured. 
Shifting from dollars to cents for each variable would increase the covariance by a 
factor of ten thousand. The covariance is standardized by dividing each deviation 
by the sample standard deviation of the variable in question. Defining 
2 

variance of X = var( X) = s2 = me 
and so on, and by some algebraic manipulation, we have a variety of ways of 
looking at and computing the correlation coefficient: 


cov( X, ¥) 


z /var( X) /var(Y) 


(2-18) 


(2-18b) 


iS aa BEND stor (2-18¢) 


Ex? Ey? 
nEXY -(©X)(ZY) 


pa ee a eee EEE A (2-18d) 
(nex? — (x) ynky? - (cYy 


Looking at Eq. (2-18c) and rearranging gives 


or b= re (2-19) 


which shows the relationship between the regression slope and the correlation 


THE TWO-VARIABLE LINEAR MODEL 25 


coefficient. Squaring Eq. (2-18c) gives 


Pee (Zxy F 
(Ex?)(Zy?) 
ie buxy 
Ly? 
_ ESS 
TSS 


18s 


2 
a eee (2-20) 
Ly?/n 

Thus r? measures the proportion of the total sum of squares explained by the 
regression. The last two expressions in Eq. (2-20) show that the limits of r are +1. 
The residual sum of squares is nonnegative. It is only equal to zero if each and 
every residual e, is zero, that is, if all the scatter points lie exactly on a straight 
line. A value of unity for r? thus corresponds to all points lying on the regression 
line. The sign of r depends upon the sign of the regression slope, that is, on the 
sign of the covariance term. The relationships in Eq. (2-20) also indicate why the 
correlation coefficient may be taken as a measure of the degree to which the 
scatter points lie close to the regression line. Note, finally, that r is a measure of 
the linear relation between X and Y; it is an inappropriate and misleading statistic 
if the relationship is nonlinear. Suppose, for example, that X indicates a firm’s 
rate of output and Y the average variable cost per unit of output. Traditional 
theory postulates a U-shaped curve, which would generate points in all four 
quadrants of Fig. 2-3, with an r? tending toward zero. 


2-4 PROPERTIES OF THE LEAST-SQUARES ESTIMATORS 


Least squares is just one possible method of estimation. Other estimators may 
easily be defined. For instance, we might order the sample data by increasing size 
of X and pass a line through the first and last points. Or one might average the 
lowest two points and the highest two points and pass a line through these 
averages. Applying the first principle to the data of Table 2-1, we have 


x Yy 
Lowest point 1 3 
Highest point 9 17 


giving an estimated slope of 14/8 = 1.75, which happens to coincide with the 
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least-squares slope. To make the line pass through the lowest point we have 
3=a+175>a=1.25 


This also ensures that the line passes through the upper point. The second 
principle gives 


x ¥ 
Average of two lowest points 15 3.5 
Average of two highest points 7.0 13.0 


giving an estimated slope of 9.5/5.5 = 1.727 and an intercept of 
a= 13 — 7(1.727) = 0.911 


Recapitulating the least-squares regression, we now have three possible equations, 
namely, 
¥=141.75X 


¥ = 1.25 + 1.75X 


Y =0.911 + 1.727X 


In this example the scatter is almost perfectly linear, and so there is little 
difference between the equations yielded by the three methods. The more disper- 
sed the scatter, the greater are likely to be the differences between the results. 

The crucial question now is how to choose between equations or, equiva- 
lently, how to choose between estimating principles. The answer given in classical 
Statistics is to choose on the basis of certain important properties of the various 
estimators. These properties refer to the behavior of the estimators in repeated 
sampling. At first sight this may seem a strange criterion. We usually have just one 
set of sample data, and we wish to do the best we can with that. Bayesian 
inference techniques, which are briefly discussed in Chap. 12, focus directly on 
that concern, but in this chapter we will outline the classical approach. 

To fix ideas, let us return to the data of Table 2-1. We assume these data to 
have been generated by the model Y = a + BX + u, and we now perform the 
conceptual experiment of imagining that repeated samples of five observations are 
drawn, but with the X values fixed from sample to sample. This is not to imply that 
economic variables are subject to experimental control and can actually be held 
constant as further samples are drawn. It is, rather, an assumption that substan- 
tially simplifies the derivation of the properties of the estimators, and it can be 
relaxed at a later stage. With the X’s fixed, the only source of variation from 
sample to sample is in the u’s, which in turn is reflected in the Y’s. Suppose that, 
say, 10,000 samples were drawn, each consisting of five pairs of X, Y values. The 
application of the least-squares principle would thus yield 10,000 pairs of a, b 
values. These could be arranged in a bivariate frequency distribution. As the 
number of samples increases indefinitely, this distribution would tend to some 
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smooth continuous function 
f(a, b) 


This is defined as the joint sampling distribution of a and b. 

As with any bivariate distribution, we can integrate to obtain the marginal 
distributions f(a) and f(b), which are the sampling distributions of a and 5, 
respectively. Concentrating on f(b), we can imagine some distribution such as 
that shown in Fig. 2-4. This is a picture of the various sample values of b that 
would be obtained by the repeated application of the least-squares method to 
successive samples of n observations. The true parameter, B, that is being 
estimated is unknown, but it is reasonable to expect some sample estimates to be 
above B and others to be below. 

There are three features of a sampling distribution that are crucial to the 
assessment of an estimator. These are the mean, the variance, and the mean- 
squared error. The mean value 


E(b) = [of(b) db 


indicates the average value that would be yielded by the estimator in repeated 
applications. The bias of the estimator is defined as 


bias(b) = E(b) — B (2-21) 


If the bias is zero, the estimator is said to be unbiased. If the bias is nonzero, the 
estimator is said to be biased. The variance of the distribution 


var(b) = of = {lo = B(b)|? f(b) db (2-22) 


measures the spread of the distribution about its mean value. The standard 
deviation of the distribution 9, is often referred to as the standard error of b. The 


S(b) 
rn 


Figure 2-4 
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smaller the variance of the sampling distribution, the greater is the precision of the 
estimator, that is, the greater is the chance of a sample estimate lying within some 
specified interval about the true value. If we are comparing two estimators which 
are both unbiased but have different variances, one would naturally prefer the 
estimator with the smaller variance. If we consider the class of all unbiased 
estimators and can find one with a smaller variance than any other, it is said to be 
a best unbiased estimator. 

A more difficult choice problem arises in comparing two estimators if both 
are biased and also have different variances. If one estimator has a larger bias but 
a smaller sampling variance than the other it is intuitively plausible to consider a 
tradeoff between the two characteristics. This notion is given formal expression in 
the mean-squared error, 


MSE(b) = E{(b— B)") = f(b - BY’ f(b) db 
Using Eqs. (2-21) and (2-22), 
MSE(5) = E{[[b — £(6)] + [E(6) — B]]’} 
= E{[b — E()]") + E([E(b) — By’) 
+2E{[b — E(b)][E(b) — B]} 
= var(b) + [bias (b)]? (2-23) 
as the cross-product term vanishes.t The mean-squared error measures the spread 
of the estimates around the true value £. Equation (2-23) shows that on the 
mean-squared-error criterion a biased estimator may be preferred to one with a 
smaller or zero bias if its variance is sufficiently small to offset the larger bias. 
With this introduction we can return to the least-squares estimator and use 
assumptions (2-8), (2-8b), and (2-8c) to derive the means and variances of their 
sampling distributions. Looking first at the least-squares slope, b may be ex- 


pressed as 
x! AY, 
sain ah ae mem) 
x; 
where W, = os (2-25) 
It follows from Eq. (2-25) that 
Ew, = 0; EwX 1 and wee (2-26) 


+ In evaluating E((b — E(b)\[E(b) — B)) the rules are the same as for summations. Any factor 
which is a constant may be moved to the left and put as a multiplier in front of the expectation sign. 
The expectation of a constant is that same constant. Thus 


E{[b ~ E()][E(b) — B]) = [E(6) ~ B] - E{b ~ E()) 
= [E(6) — 8] [E() — E(4)] 
=0 


a 
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To establish the properties of b in repeated sampling we need to express b in 
terms of the underlying stochastic variable u. Combining Eqs. (2-8a) and (2-24) 
gives 
b =Sw,(a + BX, + u) 
=B+2wu, using Eq. (2-26) 
Taking expectations, . 
E(b)=B (2-27) 


since E(u,) = 0 for all i. Thus the least-squares slope is an unbiased estimator of 
the true slope. Its variance may be expressed as 


var(b) = E{(b — B)’) 
ie E({(»,u,)’} 
But 


(Ewju,)" = Lw?u? + 2D wwjuyu, 
i<j 


Taking expectations and using Eq. (2-8c), 
var(b) = 0220? (2-28) 
which, on using Eq. (2-26), gives 
var(b) = oh (2-29) 
Turning to the intercept, 
a= Y-bX 
=a+ BX +i — bX 
=a—(b-B)X+ia 
Since E(b) = B and E(i) = 0, it then follows that 
E(a)=a (2-30) 
The variance of a is 
var(a) = E{(a - @)) 
= ¥°E((b — B)’) + E(w) — 2X E((b — Ba) 


-a[t+ S| (2-31) 


In this derivation E{a”) = 02/n since @ is the mean of a random sample of n° 
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drawings from the u distribution which has zero mean and variance o2. The 
cross-product term vanishes since 


E((b — B)a) = £{ (Smu,)(=Eu,)} 
-2(2 [eet + LD (w+ wae uj} 


inj 
=0 
The covariance between a and b is 
cov(a, b) = E{(a — a)(b— B)} 
= E{[@— (b — B)X ](b - B)} 
=-YE((b-B)) since E{(b— B)a) = 
Xo? 

ris 
Formulas (2-29), (2-31), and (2-32) all involve the unknown o;. To make the 
formulas operational, this is replaced by its estimated value s* = Le?/(n — 2). 


(2-32) 


Example 2-3 From the data in Table 2-2, 
Ix? = 40, Yet=15, and n=5 


Thus 
s?= 15/3 =0.5 
and 
0.5 1 
var(b) = AR 0.0125 
rial Se 
Pray 5 le 3, 
var(a) = 5 [: + eal 
1 16 2 
os[5 + 40| > 40 = 0.3 


We might use the same data to estimate the sampling variance of the 
slope estimated by passing a line through the lowest and highest points. Table 
2-1 shows that the smallest X value occurs at the third observation and the 
largest at the fifth observation. Thus the alternative slope estimator is 


ates Be 
eas 
1 

= 7(%- 5) 


=i [la + 98 + us) (a+ B+ u)] 


1 
= B+ 5 (us — 3) 
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Hence E(b’) =p 


so that the alternative estimator is also unbiased. Its variance is 
var(b’) = E((b’ ~ BY’) 
1 
= e{ qu +u3- 2usus)} 


o; 


~ 32 
Replacing 0, by the same estimate as in the least-squares case, the estimated 
sampling variance is 


var(b’) = a = 0.0156 


which is about 25 percent greater than the variance of the least-squares 
estimate. If we form the ratio 


we have the efficiency of b’ relative to b so that, by the least-squares criterion, 
b’ is 80 percent efficient.f 


This last example is an illustration of an important theorem, the Gauss- 
Markov theorem, that the least-squares estimators have minimum variance in the 
class of linear unbiased estimators. We will now prove this theorem just for the 
regression slope and give a proof of the general case in Chap. 5. We already know 
that b is an unbiased estimator. It is also said to be a linear estimator since its 
definition in Eq. (2-24) shows it to be a linear combination of the Y values, and 
hence it is also expressible as a linear combination of the stochastic u variables. 
Define a general linear estimator of B as 

b, = Ley; (2-33) 


where the c, (i = 1,2,..-, m) are some set of weights. Substituting for ¥, from Eq. 


(2-8a) gives 
by = abe, + BLe,X, + Lewy, 


+The alternative estimator does not provide any means of estimating 62, and we have only been 
able to estimate var(b’) above by using the ‘s? based on the least-squares residuals. However, this does 
not affect the calculation of the efficiency of b’ since the proper definition is 


Eiiciengy of b= ae 
hanes var (b’) 


Jog 7 Ext 
62/32 


32 
bb ad 0.8 
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with E(b,) = aXe; + BLc,X, 
We require the weights to be such as to make b, an unbiased estimator of B. This 
imposes the conditions 


Le= 0 and) ~ Le, X, = Le, x= 1 (2-34) 
Under these conditions the variance of b, is 
var(b,) = 02? (2-35) 


To compare this variance with that of the ordinary least squares (OLS)d, write 
c= w, + (c=) 


Thus 
Le? = Dw? + Lc; — w,)? + 2Ew,(c, — w,) (2-36) 
But 
Lwjec; = xe 
ITS 
= oe using Eq. (2-34) (2-37) 


Thus 
Ew,(¢; — w,) = 0 (2-38) 
and so 
07Ec? = 02D? + 02E(c, — w,)” 
which, using Eq. (2-28), gives 
var(b,) = var(b) + s2X(c, — w,)° (2-39) 


Since X(c, — w,)? > 0, unless c, = w, for all i, this establishes that the OLS 
estimator has minimum variance in the class of linear unbiased estimators, and 


we write 
bis a best linear unbiased estimator (b.1.u.e.) of B 


The proof of the similar result for the intercept is left as an exercise for the 
reader. 

This minimum variance property of least squares is the main reason for the 
widespread use of the technique. It rests on the assumption that the X’s are fixed 
in repeated sampling, that the relationship has been correctly specified in Eq. 
(2-8a), and that the disturbances have zero mean, constant variance, and zero 
covariances. Notice that assumption (2-8d) on the normality of the wu distribution 
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has not been used so far. If we bring this assumption into play, the sampling 
distributions of a and b are fully determined. Since a and b are linear functions of 
the w’s, they in turn are normally distributed variables. Once the mean and the 
variance are known, a normal distribution is completely specified. Thus 


ee a 
a-maoi[t+ 2) (2-40) 
and 
a, 
b~ n{a, | (2-41) 


It still remains to justify the expression given in Eq. (2-16) for the estimator 
of the disturbance variance. The ith residual is 


eas y, 

=a + BX, + u,—a— bX, 

=u,—(a—a)—(b- B)X, 
From the expression for a just before Eq. (2-30), 

a—a=i-(b-B)X 

Thus 

e, = (uj — #) — (b- B)x; 
Notice that 7, being a sample mean, cannot be set at zero, even though E(u,) = 0. 
Squaring and summing, 

Ye? = E(u, — aw) + (b— BY Ex? — 2(b — B)E(u — #)%, 
= Du? — nil + (b — BY Lx? — 2(b — B)Eu,x, 
We note that 
E(Zu?) = no? 


E(a@) = var(@) = fe 
o2 
E(b- BY = var(b) = <5 
and E((b — B)(Eu,x,)} = E(Emu,)(Eux,)) 
= 02 Dw;x; 
Thus vig 


so 
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and the estimator proposed in Eq. (2-16) is unbiased. The distribution of this 
estimator will be established for the general case in Chap. 5. 


2-5 INFERENCE IN THE LEAST-SQUARES MODEL 


In the previous section we established the sampling distributions of a and b under 

assumptions (2-8a) to (2-8d). Results (2-40) and (2-41), however, involve the 

unknown o2, and so they are not operational as they stand. To derive the 

sampling distributions of a and b when o2 is replaced by its estimator s*, we 

merely state here two results which will be proven later in Chap. 5, namely, that 
under the assumptions made so far, 
2 

de" = x2(n—2) (2-42) 


and 
Le? is distributed independently of f(a, b) 


Concentrating first of all on inferences about f and recalling that the ¢ 
distribution is given by the ratio of a standard normal variable to the square root 
of a x? variable divided by its degrees of freedom,} 


from Eq. (2-41) so 


st (6-p)\Ex? Ee? /e, 


%u (n=?) 


b-B 
s/(Ex? 


Replacing 02 by its estimator s? shifts us from a normal distribution to a ¢ 
distribution. As the sample size becomes very large, the ¢ distribution tends 
toward the standard normal distribution, so for sample sizes in excess of 30 or so 
no great harm is done in treating (b — 8)/2x? /s as if it were a standard normal 
variable. 

The standard inference procedures based on Eq. (2-43) are then as follows. A 
95 percent confidence interval for B is given by 


B  t00958/\Ex? (2-44) 
where /Lx? is computed from the sample data, b from Eq. (2-14a), s from Eq. 


t= ~ t(n - 2) (2-43) 


+ See App. A-7 on the x?, 1, and F distributions. 
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(2-16), and fo 995 is read off as the 25 percent point of the ¢ distribution with n — 2 
degrees of freedom. In general a 100(1 — €) percent confidence interval for 8 is 
given by 
b+ t,2s/VEx? (2-45) 
To test the null hypothesis that B has some specified value Bo, that is, 
Hy: B=Bo 
against the alternative hypothesis that 8 has some value other than Bp, that is, 
Ay: B+ By 
we insert By in Eq. (2-43) and then have the conditional statement. If the null 
hypothesis is true, : 
b= Bo 


s/\Ex? 


This gives the sampling distribution of b under the null hypothesis, as shown in 
Fig. 2-5. If the null hypothesis were true, 95 percent of all sample values of b 
would lie within fo 9); standard errors of Bo, that is, inside the symmetrical region 
about 8) shown in the figure. If our sample b is found in either tail of the 
distribution, then either 


~ t(n- 2) 


1. The null hypothesis is true and an unlikely event has occurred 
or 
2. The null hypothesis is not true 


In such a case we deliberately choose the second interpretation and thus follow 
this procedure. Reject Hp at the 5 percent level of significance if 


f}—-—-—------- 


By - to,0258/VEx? Bo Bo + tovonss/V Ex? 


Figure 2-5 Sampling distribution of 6 under Ho: B = Bo- 
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Accept Hg at the 5 percent level of significance if 
b— By 
s/Vix? 


In general the procedure is as follows. Reject H, at the 100e percent level of 
significance if 


S loos 


The null hypothesis most frequently tested is 
Hy: B=0 


This is referred to as testing the significance of X. If the hypothesis is true, the X 
variable plays no role in the determination of Y. Nonetheless when sample values 
of X and Y are drawn from such a population and the least-squares formula is 
applied, we will usually find nonzero values for b. These can arise from the 
fluctuations of random sampling even though the underlying £ is truly zero. 
The appropriate significance test follows directly from replacing By by zero in the 
results above. Thus reject Hy: = 0 at the 100e percent level of significance if 


b 
s/\Ex? 


The test statistic is now simply the ratio of b to its estimated standard error. Most 
computer programs for regression analysis print out the value of b and either its 
estimated standard error or else the ratio of b to its estimated standard error, and 
this is often referred to as the sample ¢ statistic} If the sample ¢ statistic is 
numerically greater than the preselected critical value of t, we accept the alterna- 
tive hypothesis and conclude that X plays a significant role in the determination 
of Y. The presentation of sample ¢ statistics is thus directly useful for making 
significance tests. If, however, one wishes to test a null hypothesis that 8 has some 
value other than zero, one requires the standard error (s.e.) for substitution in the 
appropriate test statistic. This can be obtained from the computer printout as 


b 
sample t statistic 


>hp 


s.e.(b) = 


+ In the rest of the text we will normally drop the distinction between the true standard error and 
the estimated standard error, as it is usually clear from the context which concept is implied. 
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By a similar development, tests on the intercept are based on the ¢ distribu- 
tion: 
t =————- ~ t(n - 2) (2-46) 


(Fa) 
s\/|-+— 
EER 


Thus a 100(1 — e) percent confidence interval for a is given by 
1 
att, ps (bez| (2-47) 
and the hypothesis 
Hy: “=a 


would be rejected at the 100e percent level of significance if 


a— a 


(r+) 

yfio+ap 

1 ale De eg 
Tests on o2 may be derived from the result stated in Eq. (2-42). Using that 

result one may, for example, write 


n— 2)s? 
Pe{ xd < Ga < iain} = 0.95 (2-48) 
Me 
which merely states that 95 percent of the values of a x’ variable will lie between 
the values that cut off 24 percent in each tail of the distribution. This is illustrated 
in Fig. 2-6. The critical values are read off from the x? distribution with n — 2 


P(x?) 
‘ 


Figure 2-6 
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degrees of freedom. The only unknown in Eq. (2-48) is 07, and the contents of the 
probability statement may be rearranged to give a 95 percent confidence interval 
for 02 as 
— 2)s? —2)s2 
(n— 2)s o (n — 2)s 


2 2 
X0.975 X0.025 


Example 2-4 From the data of Tables 2-1 and 2-2 we have already computed 
a=1 var(a) = 0.3 
b=1.75 — var(b) = 0.0125 
Thus 
s.e.(a) = ¥0.3 = 0.5477 
s.e.(b) = ¥0.0125 = 0.1118 
Since n = 5, from the ¢ distribution with 3 degrees of freedom, 
toons = 3-182 
Thus a 95 percent confidence interval for a is 
1 + 3.182(0.5477) 
that is, 
-0.74 to 2.74 
and a 95 percent confidence interval for f is 
1.75 + 3.182(0.1118) 
that is, 
T5910 211 
The intercept is not significantly different from zero since 
a 1 
s.e. (a) ~ 0.5477 
while the slope is strongly significant since 
bike qedl5 
s.e.(b) 0.1118 


Once confidence intervals have been computed, there is no need to actually 
compute the significance tests, since a confidence interval which includes zero 
is equivalent to accepting the hypothesis that the true value of the parameter 
is zero, and an interval which does not embrace zero is equivalent to rejecting 
the null hypothesis. 

From the x? distribution with 3 degrees of freedom x37; = 0.216 and 
X3975 = 9.35. We also have Le? =(n —2)s?= 1.5. Thus a 95 percent 
confidence interval for 6? is 


= 1.826 < 3.182 


= 15.653 > 3.182 


Edin» 29 15 
9B5 wate 216 
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that is, 
0.16 to 6.94 


2-6 ANALYSIS OF VARIANCE IN LEAST-SQUARES REGRESSION 


The test for the significance of X (Ho: = 0), derived in the previous section, 
may also be set out in an analysis of variance framework, and this alternative 
approach will be especially helpful when we treat problems of multiple regression 
in Chap. 5. 

From Eq. (2-41) we have the result 


From the definition of the x? variable in App. A-7 we then have 
b- 2 
By 424) 
0, /Lx 
and since 
2 
He ~ x(n 2) 


lu 


independently of b 


= py Ex? 
=o (bes B ern 2) (2-49) 
Le2/(n — 2) 
recalling that F is the ratio of two independent x? variables, each divided by the 
number of its degrees of freedom. If B = 0, 


yo 


b2Ex2 
=———*—__~ F(1,n - 2) (2-50) 
Le2/(n — 2) 
Referring to the decomposition of the sum of squares in Eq. (2-15), the F statistic 
in Eq. (2-50) is seen to be 


F 


EES S/T 2-51 
RSS/(n — 2) (28) 


Following this approach, the data are set out in an analysis of variance (ANOVA) 
table (Table 2-3). The entries in col. (ii) and (iii) of the table are additive. The 
mean squares in the final column are obtained by dividing the sum of squares in 
each row by the corresponding number of degrees of freedom. An intuitive 
explanation of the degrees of freedom concept is that it is equal to the number of 
values that may be set arbitrarily. Thus we may set 7 — 1 values of y at will, but 
the nth is then determined by the condition that Ly = 0. Likewise, we may set 
n —2 values of e at will, but the least-squares fit imposes two conditions on e, 


FE; 
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Table 2-3 ANOVA for two-variable regression 


Source of Degrees of 
variation Sum of squares freedom Mean square 
(i) (ii) (iii) (iv) 
Xe ESS = Dy? = B*Lx? 1 ESS/1 

= bLxy 
Residual RSS = Le? n-2 RSS/(n — 2) 
Total TSS = Ly? n-1 


namely, Le = Lxe = 0, and finally there is only 1 degree of freedom attached to 
the explained sum of squares since that depends only on a single parameter, 8. 

The F statistic in Eq. (2-51) is seen to be the ratio of the mean square due to 
X to the residual mean square. The latter may be regarded as a measure of the 
“noise” in the system, and thus an X effect is only detected if it is greater than 
the inherent noise level. The significance of X is thus tested by examining whether 
the sample F exceeds the appropriate critical value of F taken from the upper tail 
of the F distribution. Thus the test procedure is as follows. Reject Hy: 8 = 0 at 
the 5 percent level of significance if 


ESS/1 
RSS/(n — 2) 
where Fy 95 indicates the value of F such that just 5 percent of the distribution lies 


to the right of the ordinate at Fs. Other levels of significance may be handled in 
a similar fashion. 


dive > Fogs(1, — 2) 


Example 2-5 Table 2-4 shows the analysis of variance for the data of Table 
2-2. 
The sample F statistic is 


122.5 
Sample F = SOB) 245.0 


and F95(1,3) = 10.1. Thus we reject Hy: B = 0. 


; The analysis of variance test is merely the significance test on f in another 
guise, but it is useful to have introduced the procedure here, as it will be applied 


Table 2-4 
Source of Source of Degrees of 
variation squares freedom Mean square 
x 122.5 122.5 
Residual LS 0.5 


Total 124 
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extensively in later work. To establish the equivalence, recall that the ¢ test for the 
significance of B is as follows. Reject Hy: =0 at the 5 percent level of 
significance if 


b 


s/\Ex* 


From Eq. (2-50) the F test procedure is as follows. Reject Hp: B= 0 at the 5 
percent level of significance if 


b?Dx? 
Le2/(n — 2) 


The second test statistic is seen to be the square of the first. Thus 


> fovors(n — 2) 


Sako os ll) = 2) 


Sample F = (sampler)? 
It is also shown in App. A-7 that the F variable with (1, r) degrees of freedom is 
the square of at variable with r degrees of freedom. Thus 
Critical F = (critical 1)” 
and the two tests are completely equivalent. 
Finally, there is another way of looking at the F test that will also be helpful 
later. It was shown in Eg. (2-20) that 
ESS = r?TSS = r?>Zy? 
and 
RSS = (1 — r?)TSS = (1 — r?)Ly? 
Thus the sample F statistic may be written} 
Al 
F=———_—— 2-52 
(=n) “ 
Again, exploiting the relation between the ¢ and F distributions, Eq. (2-52) gives 
rl(n - 2 
= has 2) (2-53) 
Ll, 


and this statistic may be referred to as the ¢ distribution with n — 2 degrees of 
freedom to test the significance of the relationship between Y and X. From 
Example 2-2 we have 
ESS _ 122.5 
aa eh re 
1? = 765 = 124.0 0.9879 


insert unity as the divisor of the numerator in both Eqs. (2-51) and 


+ It is, of course, superfluous t ‘the n ’ 
(2-52), but it maintains a correspondence between these expressions in models where there is only one 
explanatory variable to later models with several explanatory variables. 
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giving r = 0.9939. Substituting in Eq. (2-53) gives 


Ris el Feceaieepti 
(1 — 0.9879) 
From the data given in Examples 2-2 and 2-3 
pee OTD _ 16 6 


s.e.(b) — 0.0125 


and the square root of the sample F statistic in Example 2-5 is 


t = VF = y245.0 = 15.65 


Thus all three tests are simply three versions of a single test. The test involving r 
may be regarded as a test of the significance of the correlation coefficient, that 
based on b as a test of the significance of the regression slope, and the analysis of 
variance formulation tests the significance of the explained sum of squares, but 
they are just three different ways of essentially asking the same question. 


2-7 PREDICTION IN THE LEAST-SQUARES MODEL 


Suppose we have a set of sample observations X,, ¥, (i = 1,..., n) to which we 
have applied the least-squares techniques of the previous sections. In addition we 
suppose that our interest now focuses on some specific value of the independent 
variable X,, and we are required to forecast, or predict, the value Yp likely to be 
associated with X,. For instance, if Y denoted the consumption of gasoline and X 
the price of gasoline, we might be interested in predicting the demand for gasoline 
at some higher future price. The value of X, may lie within the range of sample X 
values, or, more frequently, we may be concerned with predicting Y for a value of 
X outside the sample observations. In either case the prediction involves the 
assumption that the relationship presumed to have generated the sample data still 
holds for the new observation, whether it relates to a future time period or to a 
unit that was not included in a sample cross section. Alternatively, we may have a 
new observation (Xj, Y)), and the question arises whether this observation may 
be presumed to have come from the same population that generated the sample 
data. For example, an appeal might be made to motorists on grounds of 
patriotism to reduce their consumption of gasoline, and we may use the new 
observation to test whether the appeal is having any effect. Prediction theory 
enables us to perform both tasks. We may make two kinds of predictions, a point 
prediction or an interval prediction, in just the same way as we can give a point 
estimate or a confidence interval estimate of a parameter 8. But in practice, a 
point estimate is of little use without some indication of its precision, so one 
should always provide an estimate of the prediction error. The point prediction is 
given by the regression value corresponding to Xp, that is, 


Y= a+ bX, (2-54) 
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The true value of Y in the prediction period is given by 
Yy =a t+ BX + Uo 


where uy indicates the value that would be drawn from the disturbance distribu- 
tion in the prediction period. The prediction error may then be defined as 


€o = Yo~ % 
= uy — (a—a) — (b- B)X (2-55) 
Taking expectations 
E(e)) =0 


since E(u g) = 0 and a and b are unbiased estimators of a and B. Thus the 
least-squares predictor, Eq. (2-54), is an unbiased predictor. The variance of the 
prediction error is then found by squaring Eq. (2-55) and taking expectations: 


var(e) = E(e3) 
= var(ug) + var(a) + X} var(b) + 2X,cov(a, b) 


since the other two covariances vanish.+ On substitution from Eqs. (2-28), (2-31), 
and (2-32) this gives 
dee aie 20 


1 
mena + de st gam Ee 


(2-56) 


The variance of the prediction error is thus at its minimum value when X, = X 
and increases nonlinearly as Xp departs from X. From Eq. (2-55) eg is seen to bea 
linear function of normal variables and so is itself distributed normally. Thus 


~ N(0,1) 


Replacing the unknown o, by its estimate s = )Le?/(" — 2) then gives 


~ t(n — 2) (2-57) 


Everything in Eq. (2-57) js known except Yo, and so, in the usual way, we derive a 


+ By assumption uo is independent of uy, #2,---> Mar and thus it has zero covariance with (a — @) 


and (b — B), since these are each linear functions of uy, Uay-+-> Mn 
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95 percent confidence interval for Y, as 


i 5(%5 ~X?) 
(a + bXp) + to on58 f SS TT (2-58) 


Sometimes interest centers on predicting the mean value of Y,, that is, 
E(%) =a+ BX, 


rather than Y, itself, since there is, of course, no way of predicting the value of a 
single drawing from p(x). The prediction error is now 


€9 = E(%) — % 
Sythe im Bim (Pa 1B) Xo 


which gives 
72 
2}1 (X ws ) 
var( ey) =O; [: oF aa 
and so a 95 percent confidence interval for E' (%) is 


, = *) 


- (2-59) 


a + bX, + to 0955 s 


The width of the confidence interval in Egs. (2-58) and (2-59) is seen to increase 
symmetrically the further X, is from the sample mean X, as shown in Fig. 2-7. 


a+ bx 


=~ X 


Figure 2-7 
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Example 2-6 Assembling relevant results for the data of Table 2-1, 


Y=1+4 1.75X 

—=5 

x=4 

Sethe. ES 

2 TS 
Ca ae ee ook de 
Lx? = 40 


Suppose we require a 95 percent confidence interval for Y given X = 10. 
Applying Eq. (2-58) gives 


2 
1 + 1.75(10) + 3.182V0.5 ( + + + wo} 

that is, 

18.5 + 3.26 
or 15.24 to 21.76 
The 95 percent interval for E(Y|X = 10) is 

1. (10-4) 
18.5 + 3.182V0.5 ¥ = + ap 

that is, 

18.5 + 2.36 
or 16.14 to 20.86 


To test whether a new observation (Xo, Yo) may be thought to come from the 
structure generating the sample data, one merely contrasts the observation 
with the confidence interval for Yo. For example, the point (10, 25) gives a Y 
value which lies outside the interval 15.24 to 21.76, and one would conclude 
that it was unlikely to have been generated by the same structure as the 


sample data. 


PROBLEMS 


2-1 The least-squares estimate ofainY=at+fPX+uisa= L(1/n — Xw,)¥;, where w, = x,/Ex} 
with x, = X,— Xand 
& 
7 
Lx 


var(a) = 9, | 5 + 


a smaller variance. 


Show that no other linear unbiased estimate of « can be constructed with 
with variance 0”, then the 


2-2 Show that if z; are independent quantities from the same population, 
sampling variance of 


b= > 4,2; 


i=l 
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is o*"_,a?. Observations Y, are related to fixed quantities X, and the quantities z, above by the 
relations ¥, = a + BX; + z,(i=1,..., ). If the values of X, are 
> cae hab) 7 ee Cae 2 
1 2 3 4 5 6 
an alternative estimate of B is 
(% + %- %-%) 


Deduce the sampling variance of this estimate and compare with it with the sampling variance of the 
least-squares estimate. 
(Oxford University, 1958) 


2-3 From a sample of 200 pairs of observations the following quantities were calculated: 


UX = 11.34, LY = 20.72, LX? = 12.16, LY? = 84.96, LXY = 22.13 
Estimate the regressions Y = & + BX and X = 7+ 8. 
(R.S.S, Certificate, 1956) 
2-4 Show that if r is the correlation coefficient between n pairs of values (X,, ¥,), then the correlation 
coefficient between the n pairs (aX, + b, c¥, + d), where a, b, c, and d are constants, is also r. 
(R.S.S. Certificate, 1956) 
2-5 The percentage of fat X and the percentage of nonfat solids Y are measured on milk samples of a 
number of dairy cows in two herds. A summary of the data is set out below. Calculate the linear 
regression of Y on X for each herd, and test whether the two lines differ in slope. 
Herd A, number of cows = 16: 


UX = 51.13, CY = 117.25,  Ox?—=127, Ly? = 4.78, Lxy = 1.84 
Herd B, number of cows = 10: 


UX = 37.20, LY = 78.75, Lx? = 1.03, Ly? =2.48, LCxy =1.10 


. a (RSS. Certificate, 1956) 
[Note: If B, is N(B,, 0, or/Ext) and Pai is N( Bas 0; #3 /EX3), iele B, and f, are independent, then 
By ~ By is N(B, — By, of/Ex? + a3 /Ex3). If a and o} are unknown, then a shift to the ¢ 
distribution can be made if we assume 0? = "2 = 0° and pool the sum of squared residuals from each 
regression so that (Le? + Le3)/a? has a x? distribution with ny + nz — 4 degrees of freedom.] 
2-6 Data on aggregate income Y and consumption C yield the following regressions, expressed in 
deviation form: 
Y=l2e 
€=0.6y 
If Y= C + Z (where Z is savings), compute the correlation between Y and Z, the correlation between 
C and Z, and the ratio of the standard deviations of Z and Y. 
(RSS. Certificate, 1948) 


2-7 The table below gives the means and the standard deviations of two variables X and Y and the 
correlation between them for each of two samples. 


Number 
Sample in sample x ¥ ae %y Ty 
1 600 5 12 2 3 0.6 
2 400 7 10 3 4 0.7 


Calculate the correlation between X and Y for the composite sample consisting of the two samples 
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taken together. Comment on the fact that this correlation is lower than either of the two original 


values, 
(RSS. Certificate, 1955) 


2-8 An investigator is interested in the two following series: 


1935 °36 °37 "38 "39 "40 "41 "42 °43 "44 "45 46 


X, deaths of children 

under | year, thousands 60 62 61 55 53 60 63 53 52 48 49 43 
Y, consumption of beer, 

bulk barrels 23 23 25 25 26 26 29 30 30 32 33) 31 


(a) Calculate the coefficient of correlation between X and Y. 

(b) A linear time trend may be fitted to X (or Y) by calculating an OLS regression of X (or Y) on 
time ¢, This requires choosing an origin and a unit of measurement for time. For example, if the origin 
is set at mid-1935 and the unit of measurement is 1 year, then the year 1942 corresponds to t = 7. If 
the origin is set at end-1940 (beginning of 1941) and the unit of measurement is 6 months, then 1937 
corresponds to f= —7. Show that any computed trend value X, = a + bt is unaffected by the choice 
of origin and the unit of measurement. * 2 

(c) Let X be X with any time trend removed; that is, X, = X, — X,- Calculate the correlation 
between X and Y, and between X and Y. Compare these values with that obtained in part (a), and 
comment on the difference. 

(RSS. Certificate, 1954) 


2-9 A sample of 20 observations corresponding to the regression model 
Y=a+pXte 
where e is normal with zero mean and unknown variance 0”, gave the following data: 
ry-219,. £(¥-7)'= 869, U(X-X)(¥— ¥) = 106.4, 
Ex= 186.2, 2(X-X) =2154 
Estimate @ and B and calculate estimates of variance of your estimates. Estimate the (conditional) 


mean value of Y corresponding to a value of X fixed at X = 10 and find a 95 percent confidence 


interval for this (conditional) mean. 
(UL, 1958) 
2-10 Consider the regression without an intercept Y, = BX, + uj (i= 1,..., ) for which all standard 


assumptions hold. Suppose we want to predict Ye Find a predictor ¥9 such that E( x) = E( Ya). 
(University of Michigan, 1981) 


2-11 If the sample values of X in the linear model 
Y¥,=a+ BX, + % 


have zero mean, show that the covariance of the least-squares estimates of and B is zero. Hence, or 
otherwise, prove that an unbiased estimator of B can be derived by estimating the equation 


¥, = BX, 


which is constrained to pass through the origin. What is the variance of this estimator of fe re 
(UL, 1973) 


CHAPTER 


THREE 


EXTENSIONS OF THE 
TWO-VARIABLE LINEAR MODEL 


The obvious limitations of the two-variable linear model are that it is linear and 
embraces only two variables. These restrictions limit the variety of statistical 
phenomena for which it provides an adequate description. In this chapter we 
describe some of the more important ways of extending the range of the model. 
First of all we discuss the case of replicated observations for various X values, 
which enables us to test the adequacy of the linear representation against the 
alternative hypothesis that the relationship of Y to X is nonlinear, Then we discuss 
various types of nonlinearity and the ways in which they may be handled. In 
some cases suitable transformations of the variables return the problem to a linear 
framework, in which case the simple techniques of Chap. 2 may be applied. In 
others, nonlinear relations have to be fitted directly, but the discussion of 
nonlinear estimation is beyond the scope of this book. Finally, an introduction to 
three-variable regression is provided, preparing the way for a general treatment of 
multiple regression in Chap 5. 


3-1 REPEATED OBSERVATIONS AND A TEST OF LINEARITY 


Suppose now that we have sample data which could be arranged in the form 
shown schematically in Table 3-1. 


Here we have p distinct values of X and m observations on Y corresponding 
to each X observation, giving n = mp sample observations altogether. This 


48 
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Table 3-1 Sample data with replication 


x ¥ Mean ¥, 
xX Y Nie Yim y, 
x Yoi¥oo ++ Yom Y, 
x, Yr¥i2-** Yim ¥, 
x, YoYp2 “> Yom Y, 


situation is common in experimental design work: X, for example, might indicate 
the amount of fertilizer applied to standard sized plots and Y the yield per plot of 
some crop; or X might indicate drug dosage and Y the response of a patient. 
Since no two plots (or patients) are completely identical, they may be expected to 
display varied responses to any given level of X, and so repeated observations on 
Y are desirable. We have assumed in the layout of the table that we have the same 
number of replications m for each value of X. This is a simplification to keep the 
formulas as uncomplicated as possible. The various formulas can easily be 
amended to deal with the general case where there are m, observations on Y 
corresponding to X; and a total sample size of 


P 
n= Ym, 


i=l 


The first step in the analysis of replicated data is to compute the row means 


= hee 
Kah My ea 


j=l 


where Y,, denotes the jth observation in the ith row or class. The scatter diagram 
might look like Fig. 3-1 for the case of m=4 and p= 6, with the circles 
indicating individual observations and the black squares the sample mean values 
of Y. Clearly, we can fit a least-squares regression to this scatter of 24 observa- 
tions by the methods of Chap. 2. In this case, however, that turns out to be 
identical to the regression fitted to the six mean points shown on the scatter.t To 
see that this is so, consider the formula for the regression slope 


The deviations in this formula are measured from the overall sample means, 


+ This result does not hold when there are unequal numbers of observations in each class. See Eq. 


(3-6). 
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Y=a+ bx 


o nl L n 1 n 1 oe 
x, Xx, x3 X4 Xs X6 
Figure 3-1 
which we may now define as 
es 1 Pom 
Deas y, 
mp xX 1 
or, using Eq. (3-1), 
P 
v-1yy, (3-2) 
i=] 
It is convenient to denote the X observations by X, j» with the proviso that 
Xi = Xi = +++ = Xi = X, for i = 1,2,..., p. Thus 
ae 1 Pom P 
KS ay xy ey 
Piya yo} i= 
Then 
Eevee she 
ox* = yy x (%, Xx) 
i=l j=l 
< z\2 
=my(xX,>X) 
i=l 
Pom ns 
and Ey = EE (%,- X)%,- F) 
i=l j= 
P ores. in 
~E(%-¥)E (x, -¥) (3-3) 
= j 


remembering that anything which is constant with respect to a particular summa- 
tion sign can be moved leftward in front of that summation sign. But 


(i, F) = Ells iw Yj) +(¥-¥)] =m(¥- 7) 


jel 


EXTENSIONS OF THE TWO-VARIABLE LINEAR MODEL 51 


since the first term vanishes and the second is a constant with respect to 
summation over j. Thus 

P 

xy =m D(X, - X)(%- ¥) (3-4) 

i=l 
Putting Eqs. (3-3) and (3-4) together, the regression slope is 
ae. =i X\Y, ie Y) 

Ep.(x,- XY 

but this is also the regression slope fitted directly to the X;, Y, points. The reason, 
of course, is that the m factor in Eqs. (3-3) and (3-4) cancels out and the class 
means are, in effect, each given a weight of unity in determining the regression 
slope. In the more general case of unequal numbers of observations, the regres- 
sion slope would be given by 
reer (Xy =F \(7; ¥) | opm (% — X)(%— ¥) (3-6) 


Spry (My 7 xi EP) ( X, sil y 


which can be regarded as a weighted regression applied to the class means, the 
weights being equal to the number of observations in each class. 

In scatter diagrams such as that in Fig. 3-1, the class means will usually not 
lie exactly on the regression line, but will be spread around it in the same way as, 
but to a lesser degree than, the sample points are spread around the regression 
line. We are thus led to pose two related questions. 


Us (3-5) 


1. Is there any relationship between Y and X? 
2. If so, is it a linear or a nonlinear relationship? 


The problem may be modeled as follows. Set up the hypothesis 
Yj = bit &j fori = 1,2,..., pif = 1,2,.-.,m (3-7) 


i 


where the e,,’s are assumed to be independent normal variables with zero mean 
and variance 0”. Thus 


E(¥|X,) =m, for?=1,2,.-.57 (3-8) 
The hypothesis embodied in Eq. (3-7) states that the m observations on Y in the 
ith class (corresponding to X = X;) are random drawings from a normal popula- 
tion with mean p, and variance o2, and that a similar statement holds for the 
observations on Y in each of the p classes, the only systematic variation between 


classes being in the underlying mean values 

By: Has--+s Bp 
Note carefully the distinction between ; and ¥,, the former being the true but 
unknown mean of the distribution from which the Y, observations are drawn and 


the latter being the actual mean of those sample observations. — 
The variation hypothesized for the p’s allows a very flexible and general 
relationship between Y and X. In this context the null hypothesis of no relation- 
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ship between Y and X could be set up as 
Ho: fy =h2=--? =p, =p (3-9) 


A test of Hg can be based on a decomposition of the sum of squares in Y. We 
have 
(5 Ya (ter%) (0 - ¥) (3-10) 


iy 


Squaring and summing over all sample observations gives} 


X(%,-¥)-L(%- FY +mL(H-7/P (3-11) 
ig tod i 

The decomposition of the sum of squares in Eq. (3-11) is similar to that given for 
the linear regression model in Chap. 2. The left-hand side again represents the 
sum of the squared deviations of Y, measured from the overall sample mean. The 
first term on the right-hand side is the sum of the squared deviations of the Y’s, 
measured now about the relevant class means, and the second sum of squares is 
that due to the variation of the class means about the overall mean. We can thus 
describe Eq. (3-11) in the form 


TSS = RSS + ESS 


residual or error sum of squares “explained” sum of squares due 
{ Total sum of squares in ¥} {radasner widlessaier) i i id } 
In conventional analysis of variance treatments the right-hand-side terms are 
described respectively as the within class and the between class sums of squares. 
The assumption of normality for the ¥’s now enables us to derive the 
following test of Ho, based on RSS and ESS. From the results on the x? 
distribution in App. A-7, 


1 - 
aa(%,— %) ~ xX2(m= 1). foreaché = 1,2,...,p 


J 


Further, since the sum of independent x? variables is also distributed as x”, 
te ¥,)° ~ x?(pm — p) (3-12) 

Since the mean of a x? distribution is equal to its degrees of freedom, we have 
| x (hy 2 ry 


o2 


bee 


+ We now write the double summation =p_,L7., simply as ¥,, - Equation (3-11) is derived by 
noting that the cross-product term vanishes, that is, 


X(%- H)H-F)-L(%-7)L(y,- 7) =0 
ij i j 
since (¥,; — ¥,) = 0 for each i = | Bs 
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or 
=\2 
2[ a} ae (3-13) 


Under the null hypothesis the Y, are independent normal variables with mean p 
and variance o?/m. Thus 


(7-7) 
OZ i Ma a (14) 
oO 
and so 
Saat 
o{ eA tY =o? (3-15) 


We state, without proof, that under the null hypothesis the sums of squares in 
Eqs. (3-12) and (3-14) are independently distributed. Thus since the F distribution 
is given by the ratio of two independent x? quantities, each divided by its degrees 
of freedom, under Hy 


mZ(¥,— ¥) (p= 1) _ ESS/(p = 1) 


= ~F(p- 1, | 
©, (%y 7 ¥,) /p(m a9) RSS/p(m — 1) ba it F 


Fu 


(3-16) 


The rationale of this test is easily seen from Eqs. (3-13) and (3-15). Under the null 
hypothesis the numerator and the denominator of F are independent estimates of 
o”, and thus we may expect F to vary randomly about unity. If, however, the null 
hypothesis is not true, the between class sum of squares in the numerator of F will 
reflect more than just the random variation of the class means about a common 
mean, and F will rise in value. The null hypothesis is then rejected if the 
computed value of F exceeds a preselected critical F value from the upper tail of 
the distribution. 

The sums of squares in this F statistic may also be used to define the 


correlation ratio 7 as follows: 


= = = \2 
2 _ ESS mE i(¥, — ¥ y eet Ej (%y E ¥) (3-17) 
Mf 7 =o. =v 
TS Ee) Z,4(%— ¥) 
This is analogous to the definition of r? in Chap. 2, with the exception that the 


explained sum of squares is based on the variation of the sample means Y,, rather 
than the variation of the regression values Y,. The F statistic of Eq. (3-16) may 


then be stated equivalently 


pete) 
(1 = 1°)/p(m — 1) 


which has the same structure as the expression involving r? in Eq. (2-52). 
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Table 3-2 One-way analysis of variance (ANOVA) 


Source of Degrees of 
variation Sum of squares freedom Mean square 
. = =. 
Between classes mYy,- Y)? = ESS p- ESS/(p - |) 
i=) 
Within classes Li, - ¥)? = Rss p(m~1) RSS/p(m — 1) 
i 
Sime 
Total L(y - ¥) = TSS mp — 
ay 


The test defined in Eq. (3-16) is the standard one-way (or one factor) analysis 
of variance. It is based solely on the Y observations and is a test of the 
homogeneity of the p class means. The only role for the X variable has been to 
classify the Y’s into classes associated with a common X value. Thus the X 
variables could have been qualitative variables, such as socioeconomic status or 
educational level. The data for the test are normally set out in an ANOVA table, 
such as Table 3-2, 

The second and third columns of the table are additive, as usual. Thus, once 
any two of the sums of squares have been calculated, the third follows by using 
TSS = ESS + RSS. 

We can now carry this analysis one step further and derive a test for the 
linearity of the relationship between Y and X. Figure 3-2 shows just one point 
(X,, ¥,,) from the X, ¥ scatter, and Y = a + bX indicates the linear regression 
fitted to the data. 


¥=a+ bX 


Figure 3-2 
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The deviation of Y,, from the overall mean Y may be decomposed as 
follows}; 

(%-¥)=(%)-%)+(%-%)+(%- 7) (3-18) 

This decomposition embraces both the linear analysis of Chap. 2 and the analysis 
of class means in this section, as shown by the following groupings. 

Linear analysis 
int Fe = =" ar | 
(4, - ¥) = = H+ Ors YH) + 
Class mean analysis 


Lary =. 
(%- Y) 


Squaring Eq. (3-18) and summing over all the sample observations gives} 


X(¥%,- FP =L(%- KY + mL (¥- Hi) + md (H- 7) 
inj ia i i 
(3-19) 
In words, Eq. (3-19) states 
sum of squares 
Total sum sum of squares due to variation sum of squares 
fs tua = fo class | + ( of class means } + [ie to linear 
in Y means about linear regression 
regression 


We notice from Eq. (3-19) that r? is the ratio of the third term on the right-hand 
side to D, (Yj — Y)?, and 7’ is the ratio of the sum of the second and third terms 


to the same total sum of squares in Y. Thus 


wer 
The equality would only hold if all class means fell on the linear regression line. 
The middle term in Eq. (3-19) is thus proportional to the excess of 1? over r? and 
is the basis of the linearity test set out in Table 3-3. 

The sums of squares may be calculated in various ways. If r? and 7? have 
already been calculated, the required quantities follow directly. If not, S, can be 
calculated first; S, is then the explained sum of squares due to the regression, to 
be calculated by the methods of Chap. 2; S, can be calculated directly; and S, 
then follows from the fact that 


SS ch ASy ae 


If the class means deviate significantly from the linear regression, S, will tend to 
be large in relation to S,, when appropriate allowance has been made for degrees 


} In Fig. 3-2 we have shown, for simplicity, a case where Y;, exceeds ¥;, which in turn exceeds Y, 
which again exceeds Y, so that all the deviations on the right-hand side of Eq. (3-18) are positive. In 
general, of course, these deviations vary in sign. ss a 

+All three cross-product-terms vanish, Those involving (¥;; — ¥;) vanish since L(¥j, — ¥;) is zero 
for all i, The term involving £,(¥, — ¥,)(¥; — Y) is simply proportional to the covariance between the 
regression values ¥, and the residuals ¥, — Y, about the regression line, and thus also vanishes. 
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Table 3-3 Test of linearity 


Source of Degrees of Mean 
variation Sum of squares freedom square 
Regression mY, - ¥)? = 775, =S, 1 
i 
Class means 
about Ee RS hh TE eet OR “ =) 
regreseion mT = (== p-2 Sp -2) 
Within classes X%,- ¥)? = (= WS, = 5 p(m= 1) S4/p(m—1) 
hj 
Total L(yjavy =S, mot 
hy 
of freedom. The test for linearity is thus based on 
S. =) 
_ of (p=2) (3-20) 
S;/p(m — 1) 


and the linear hypothesis is rejected in favor of a nonlinear alternative if the 


computed F exceeds a preselected critical value from F( p — 2, p(m — 1)). 


Example 3-1 We have eight classes (p = 8), each class being defined by a 
Particular value of X, and five observations (m = 5) in each class, giving 40 
observed points on an X, ¥ scatter. The class means are shown in the final 
Column of Table 3-4 and display a negative relationship with X, but the class 


means do not lie exactly on a straight line, 


We now wish to build up the numerical equivalent of Table 3-3 for these 


data. The first step is to fit the linear regression of Y, on X,. The eight pairs of 


values yield the following numbers: 


UX, = 250 
i 


> X? = 9,500 
i 


Table 3-4 Replicated data 


LY = 165.6 DL¥?=503656 Px Y= 3,585 
i i i 


Class 


x af 


¥, 
fey 10 48 SI 44 4) 52 47.2 
i=2 15 33 32 37 42 41 37.0 
ea) 20 28 28 22 26 23 25.4 
i=4 30 22 17 26 14 19 19.6 
i=5 35 22 17 10 Is 10 148 
i=6 40 10 10 in ll 10 10.4 
i=7 45 il 5 9 10 9 88 
i=8 55 2 0 0 6 4 24 
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from which 
2 
¥ (x, - ¥) = 9,500 — Gey = 1,687.50 
Een 165. 
¥(x% - X)(¥ - ¥) = 3,585 = Ceo — 1,590.00 
2 
¥(¥ - ¥)° = 5,036.56 OS" 1,608.64 
i 
so 
1590 
=T Sessa ne 
and a= 185 + o.9422( 2) = 50.144 


giving the regression equation 
¥ = 50.144 — 0.9422. 
Turning now to Table 3-3, we compute first of all the overall sum of squares: 


S= L(Y vy) 


ij 
2 1 i 
22} - 9 (E%) 
inj td 
1 2 
= -=— = 8.480.4 
25,620 40 (828)° = 8.480 


The sum of squares within classes, 53, is obtained by calculating the sum of 
squared deviations within each class about the class mean and aggregating 
the result over all classes. For the ith class we have 


2 
s\2 1 

E(y,- Fy -£4-a(E%] 

j 6 J 
For the first row in Table 3-4, this gives 

(48)? + (51)? +--+ + (52) — $(48 + SIH + 52)° = 86.8 
Carrying out the same calculations for the remaining rows and aggregating 
gives 
S, = 86.8 + 82.0 + 31.2 + 85.2 + 102.8 + 1.2 + 20.8 + 27.2 = 437.2 

The square of the correlation ratio may now be computed from Eq. (3-17) as 


437.2 


z=] — = 0.94 
w= 3,480.4 0.948446 
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Table 3-5 Sums of squares from the data of Table 3-4 


Sum of Degrees of Mean 
Source of variation squares freedom square 
Regression S, = 7,490.67 1 
Class means about regression S,= 552.53 6 92.09 
Within classes S;= 437.20 32 13.66 
Total 3, = 8480. ED) 


The next simplest quantity to compute is 


S, =my(¥,- y) 


From the regression of ¥, on X, the explained sum of squares is 


2 
L(%Z-y/y= sy = 1,498.133 


Thus 
S, = 5(1,498.133) = 7,490.67 


The remaining sum of squares, S,, can now be obtained by subtraction, and 


Table 3-5 is prepared. 


} There is a subtle point concerning the interpretation of r? in this example. We have seen that in 
the special case, where there is a constant number of observations per class, the regression slope may 
be calculated by considering either the 40 observations (X, j ¥;;) or the eight observations (X,, ¥). 
There are, however, two distinct r?, One relates to the regression of Y, on X, and the other to the 
regression of Y;, on X;,. It is intuitively clear that the former r? must exceed the latter since the class 
means will lie closer to the regression line than the raw data. In Table 3-3 S is expressed as rS,, and 


this r? is the one relating to the raw data. By definition, it is 
Re [Es Xi, - ¥)(¥%, - ry 
s\2 = 
[p.,(%,- ¥)] [E04 - FY] 


Using Eqs. (3-3) and (3-4), 


¥(%, — ¥)° = 5(1,687.50) = 8,437.50 
iy 
2%) - ¥)(% ~ F) = 5(-1,590) = ~7,950 
as 


and £,, (Yj; — Y)? has already been calculated as 8,480.40. Thus 


r? = 0.883292 
Then 


5S, = (n? — r?) S, = (0.948446 — 0.883292)8,480.4 = 552.53 
which agrees with the value obtained by subtraction in Table 3-5. 
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The test statistic for the linear regression is 


s/l 7,490.67 
F= 3 =— = 
(S, + S,)/38  989.73/38 2) 


The 5 percent critical value from F(1, 40) is 4.08, and the 1 percent value is 
7.31, so this is a highly significant sample statistic, and the data would lead us 
to reject decisively the hypothesis of a zero coefficient for X. The test statistic 


fe} 
le} 


¥ = 50.144 — 0.9422X 


30 
20 
10 
“4 ; ; , i ze 50 ie 9 45 50 55 fas 
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for the /inearity of the relation is 


_ 5/6 _ 92.09 _ 
P= 3732 13.66 7° 4 


and the 1 percent critical value from F(6, 30) is 3.47. This sample statistic is 
also significant, and we conclude that the true relationship is probably 
nonlinear. The scatter, regression line, and class means are shown in Fig. 3-3. 


3-2, NONLINEAR RELATIONS 


The relation 
Y=a+BX+u (3-21) 


is linear in the parameters a and B and also in the variables X and Y, and, as we 
have seen, application of the least-squares principle results in two simultaneous 
equations, which are linear in the estimates a and 6 and thus easy to solve. The 
relation 


log¥=a+BX+u (3-22) 
may be written 
Z=a+BX+u (3-23) 
where 
Z = logy (3-24) 


Another approach is to introduce additional terms in X on the right-hand 
side of the relation. If Y denoted this average variable cost per unit of output and 


X the rate of output, the U-shaped cost curve of economic theory might be 
depicted as 


Y=a+BX+yX? +4 (3-25) 


The nonlinearity between Y and X here requires the introduction of an additional 
explanatory variable, X?, and we are now involved in a three-variable Tegression 
problem, which will be taken up in Sec. 3-4. Relation (3-25), however, is still 
linear in the parameters a, 8, and y, and the least-squares principle can easily be 
extended to deal with three (and more) variables. 
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A more difficult relation is one such as 
yo=at BX’ +u (3-26) 


If we attempted to apply least squares directly, we would have to choose values of 


a, b, and c to minimize 
n 


n 
Leta L(y a- oxy 
i=} i=l 
The resultant normal equations are nonlinear in the unknowns and cannot be 
solved analytically. If, however, we consider a somewhat different relation, 


Y = BX%u (3-27) 
taking logarithms of both sides gives 
log Y = log B + ylog X + logu (3-28) 


There are nwo crucial differences between Eqs. (3-26) and (3-27): in the latter the 
intercept a is assumed to be zero and the disturbance term has also been entered 
multiplicatively rather than additively. The consequence of these assumptions is 
the linear relation in Eq. (3-28). If b and c denote the intercept and the slope, 
respectively, of the least-squares regression of log Y on log X, we obtain estimates 
of B and y as follows: 

Estimate of y = ¢ 

Estimate of 8 = antilog b 


The quality of these estimates and the justification for the application of least 
squares to Eq. (3-28) depends, however, on log u having the properties postulated 
for the disturbance term in Chap. 2. The contrast between the treatment of the 
disturbance term in Eg. (3-26) and (3-27) illustrates one of the fundamental 
problems of the econometrician. Economic theory cannot yield precise proposi- 
tions on the nature of the disturbance term, and so the econometrician proceeds 
somewhat pragmatically by first of all assuming the most convenient form for the 
disturbance term in any specific application and then attempting to test these 
assumptions as far as possible by the methods described in subsequent chapters.} 

In principle the two main approaches to the fitting of nonlinear relations are 
either to seek transformations of some or all of the variables to reduce the 
problem to a linear form or else to fit the nonlinear form directly. The first 
approach is analyzed in more detail in the next section. 


3-3 TRANSFORMATIONS OF VARIABLES 


In very rare cases, economic theory may indicate the appropriate transformation 
of variables. As Zarembka has pointed out, the constant elasticity of substitution 


+The late Sir Julian Huxley once defined God as a “personified symbol for man’s residual 
ignorance.” In a similar vein the disturbance term might be regarded as the econometrician’s stochastic 
symbol for his residual ignorance, and then, as is sometimes done with God, the inscrutable and 
unknowable may be ascribed the properties most convenient for the current problem. 
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(CES) production function} 
Y = (a,K? + a,L?)’/” 
gives 
Y°/” = a, K° + aL? (3-29) 


Thus each observation on output should be raised to the power p/v, and each 
observation on capital and labor inputs should be raised to the power p. This is 
an example of power transformations of the variables, and although Eq. (3-29) is 
linear in the transformed variables, it still poses difficult estimation problems. 
Some special cases of power transformations, however, yield simple estimating 
procedures, and we now turn to these. 

Returning to two-variable relations, let us denote a transformation of the Y 
variable by Y1), This symbolism indicates that the transformation depends only 
on a single parameter ,. Likewise, let us indicate the transformation of the X 


variable by X2), A very general form of transformation has been proposed by 
Box and Cox, namely,+ 


yy -1 
y=l— y, Biel (3-30) 
Iny A, =0 
and similarly, 
xX%-1 
Shires AAT © oui w (3-31) 
In X A, =0 


At first sight these seem needlessly complicated transformations, and one might 
well ask, why not use the simple power transformation Y™:, ¥*2, An examination 
of Fig. 3-4 indicates the rationale behind the Box-Cox transformation. Fig. 3-4a 
shows the simple power transformation ¥* for two illustrative values of Y, 
namely, 10 and e = 2.84128. The transformed variables cross at the (0, 1) point, 
and to the left and right of that point their ordering is reversed. The simple power 
transformation is thus unsatisfactory since different values of \ would not 
Preserve the ordering of the data. Fig. 3-4b shows the graphs of ¥Y*/) for the 
same two values of Y, and now the ordering is the same for all values of A, but a 


{ Chap. 3, “Transformation of Variables in Econometrics,” in P. Zarembka (Ed.), Frontiers in 
Econometrics, Academic Press, New York and London, 1974, The constant elasticity of substitution 
(CES) production function was introduced by K. J, Arrow, H. B. Chenery, B. S. Minhas, and R. M. 
Solow, “ Capital-Labor Substitution and Economic Efficiency,” Review of Economics and Statistics, 43, 
1961, pp. 225-250. The elasticity of substitution is given by 1/(1 — p) with p < 1, and » denotes 
returns to scale. 

+G. E. P. Box and D. R. Cox, “An Analysis of Transformations,” Journal of the Royal Statistical 
Society, ser. B, 1964, pp. 211-252. From now on the symbols log and In denote logarithms to base 10 
and to base e (natural logarithms), respectively. 


(0) (9) 
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discontinuity occurs at A = 0. Finally, Fig. 3-4c shows the graph of the general 
transformation 


y-1 
oN 
and now the ordering is the same for all values of A, and there is no discontinuity 
at \ = 0. If we substitute A, = 0 in Eq. (3-30), we obtain YO” = 0/0, which is 
indeterminate. However, the application of L’H6pital’s rule shows that} 


lim ¥° = iny 
A,>0 


Suppose now that the transformed variables fit the linear model, that is, 
YO) = ay + BX) + y (3-32) 


This model has five basic parameters, namely, @, B, A,, Az, and 62. In this 
section we will consider only some special cases corresponding to particular 
values of A, and A. 


Case 3-1: A, = 1 =), (Linear model). Combining Eqs. (3.30), (3-31), and (3-32) 
now gives 


Y=at+PpXt+u 
where 

a= 1+ 4,— 8 
This is the simple linear model of Chap. 2. 


+ For a definition of L’Hopital’s rule see, for example, A. C. Chiang, Fundamental Methods for 
Mathematical Economics, 2d ed., McGraw-Hill, New York, 1974, Pp. 420. The application of the rule to 
Y states that 


im YO a jr (4/4a,)(¥™" — 1) 
SA eae (47d, )(A,) 
= li Ay 
Aisa (aE In¥) 
=Iny 


This development uses the result that (d/dd,(¥™") = Yn Y. To derive this from first principles, 
consider a general function 


sad 
where a is some constant, and define 
z=Iny=xlna 
Boy Ad 
Then dx dy dx y de 
But Zeina 
Thus dew. raeer ni 


dx 


EXTENSIONS OF THE TWO-VARIABLE LINEAR MODEL 65 


¥ Bol 


0<B<1 


Ag-----—— 


oO ~ X 
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| 
| 
d ta 
1 
Figure 3-5 The log-log model. 


Case 3-2: 4 = 0 = Az (Log-log model). Combining Eqs. (3-30), (3-31), and (3-32) 
now gives 
InY=a,+AlnX+u (3-33) 
Again, all the techniques of Chap. 2 may be applied, once the original data have 
been transformed to logarithmic form. Notice that although Eq. (3-33) is specified 
in terms of logarithms to base e, one may take logarithms to base 10 in carrying 
out the empirical work. The estimate of ay will be affected by the choice of base, 
but that of 6 will not.¢ 

Ignoring the disturbance term in Eq. (3-33), the relationship between Y and X 
is 


Y = A,X? (3-34) 
where a 
0 — % 
From Eq. (3-34) a 
cee oa 
dX BAgX’ 


so that, if B is positive, the slope is always positive and Y tends to infinity as X 
tends to infinity. If B exceeds unity, the slope increases continually as X increases, 
while if 0 < B <1, the slope decreases continually, though always remaining 
positive. When f is negative, the slope is always negative. This gives the shapes 
pictured in Fig. 3-5. The relationship only exists for positive values of the 
variables. 

The double logarithmic relationship has a very important characteristic. It is 
a constant elasticity function, and that elasticity is given by B.t Thus when Eq. 


}Ifiny =, then Y=e™, 
logjo¥ = Nlogioe = In Ylogioe 
Thus Eq. (3-33) may be written 
logio¥ = (aologioe) + BlogioX + (ulogioe) 


so that the intercept term being estimated is aglog10¢- : 
$ See App. A-2, Exponential and Logarithmic Functions. 
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Y Y 
ry ' 
en 
y = ela + BX) y = ela + BX) 
g B>0 B<0 
(2) +X O +X 
Figure 3-6 


(3-33) is fitted to the data the regression slope is a point estimate of the elasticity. 
If B = —1, Eq. (3-34) gives 

XY =A, (3-35) 
which is a rectangular hyperbola. If Y denoted the quantity purchased and X the 
price per unit of some commodity, then Eq. (3-35) would represent a demand 


curve with constant elasticity of —1 and a constant total expenditure on the 
commodity, whatever its price. 


Case 3-3: A, = 0, = 1 (Semilog model). This combination of A values givest 


InY=a+BX+u (3-36) 
where 
a=a,-B 
From Eq. (3-36), 
1 ay 
Yax B (3-37) 


Thus the proportionate rate of change in Y per unit change in X is a constant and 
equal to £. The function is only defined for positive values of Y. Ignoring the 
disturbance in Eq. (3-36), we may rewrite the function as 


Toca eae (3-38) 
and its general shape is shown in Fig. 3-6. The intercept is given by e*, and the 
slope is positive or negative, depending on the sign of B. 

+ Equation (3-36) is a widely used specification in human capital models, where Y denotes earnings 


and X years of schooling. The specific functional form is derived from theoretical considerations byJ. 
Mincer, School, Experience and Earnings, Columbia University Press, New York, 1974. 
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Table 3-6 Bituminous coal output in the United States, 1841-1910 


Average annual 


output 
(1,000 net tons) 
Decade Y Log ¥ X=t 
1841-1850 1,837 3.2641 a 
1851-1860 4,868 3.6873 =2 
1861-1870 12,411 4.0937 ik 
1871-1880 32,617 4.5135 0 
1881-1890 82,770 4.9179 1 
1891-1900 148,457 5.1718 2 
1901-1910 322,958 5.5092 x} 


sotto ee 


A special case of Eq. (3-38) occurs when X denotes time and the function 
then describes a variable Y which displays a constant proportionate rate of 
growth (B > 0) or decay (B < 9). 


Example 3-2 The first step in any empirical application of Eq. (3-36) is to 
check visually on whether the constant growth assumption seems warranted. 
This may be done either by plotting log Y against time or, equivalently, by 
plotting Y against time on commercial semilog paper. The first procedure 
applied to the data of Table 3-6 gives the scatter diagram shown in Fig. 3-7, 
and it is clear that the relationship is approximately linear. 


Tog y = 4.4510 + 0.37601 


45 


° 
Ir 
alt 


Figure 3-7 
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In fitting the constant growth curve 
Y, = et*4 (3-39) 


to the data of Table 3-6, we have to be careful about the treatment of the 
time variable, about the base to which logarithms are taken, and about the 
way in which growth rates are expressed and measured. Equation (3-39) is a 
continuous time formulation. It may be expressed as 


Y, = Yeh (3-40) 
where Yo = value of ¥ at r = 0 
and 
B= + a = instantaneous rate of growth of Y at time t 
t 


When time is measured in discrete intervals, such as quarters or years, a 
constant growth series would be expressed as 
¥,= %(1 + g)' (3-41) 
where g = proportionate rate of growth in Y per unit of time 
Taking logarithms of Eq. (3-41) to base 10 gives 
log ¥, = log ¥) + [log(1 + g)]e (3-42) 
This is the equation estimated with actual data. Thus 
Intercept = estimate of log Y% 
Slope = estimate of log(1 + g) 


and so an estimate of g can be obtained. Taking logarithms of Eq. (3-40), also 
to base 10, gives 


log ¥, = log ¥) + (Blog e)t 
Comparison with Eq. (3-42) shows that 
Blog e = log(1 + g) 
or 
B=In(1 + g) (3-43) 


which can be used to Provide an estimate of B Corresponding to any 
estimated g. The interpretation is that B is the rate which with continuous 
compounding would give the same result as a single increment at rate gt We 


of adding interest of 10 percent once a year, 5 percent should be added twice a year. After 2 years of 
graduate school your $100 would now grow to $100(1.05)* = $121.55, so that your perspicacity has 
paid off to the tune of 55 cents. You now become greedy and Suggest that 1 percent be added 10 times 
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can select the origin and the units of measurement for time at our conve- 
nience. In the final column of Table 3-6 we have chosen the midpoint of the 
1871-1880 decade as the origin and measured time in units of 10 years. The 
normal equations for fitting a linear regression of log Y on ¢ are 


Llog Y = na + byt 
Lrlog Y = abt + byt? 


Since the 1’s have been chosen so that Dr = 0, these equations simplify to 


ee Llog Y 
n 

ie Big 
“ 


For the data in the table, these equations give 
a = 31.1575/7 = 4.4511 
b = 10.5285/28 = 0.3760 
Thus the regression estimate of Eq. (3-42) is 
Pca 4 
log ¥ = 4.4511 + 0.3760¢ 


The r? is 0.9945, which confirms the story of the scatter diagram that the 
constant growth curve fits this series very well, To find the estimated growth 


rate, 
log(1 + g) = 0.3760 
1+ g = 2.3768 

Thus 
& = 1.3768 


Since 1 was measured in units of 10 years, this gives the estimated rate of 
growth per decade as 137.7 percent. The annual growth rate is found from 


(1 +r)'° = 2.3768 


a year, or perhaps } percent 20 times a year. You have, in fact, discovered the principle that 


ry ry t+ ‘) ee 
ans (i4g) <(04g) << ( 
Unfortunately it is no magic device for unlimited increases in your wealth. The sequence has a limit, 
namely, 
. ia \ gee 
din (+5) 
where 
1)" 
e= lim () + 2) = 2.41828 
agian i‘ eae cS 

For a 10 percent growth rate, r = 0.10 and e” = 1.1052. Thus continuous compounding within a period 
at a rate of 10 percent is equivalent to a growth rate of just over 10} percent over the period. 
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giving 

r = 0.0904 
or just over 9 percent per annum. From Eq. (3-43) the corresponding 
continuous rate B is 0.0866. 


Case 3-4: d; = 1,4, = —1 (Reciprocal model). These values give the relation 
YeatA(z)+u (3-44) 


The slope is d¥/dX = —B/X?. Thus if B is positive, the slope is everywhere 
negative, and conversely it is positive when B is negative. Since 1 /X > 0 as 
X > 00, w denotes an asymptotic value for Y. The shape of this function is 
indicated in Fig, 3-8. 


before anything is spent on this commodity. Thus Eq. (3-44) cannot serve as a 
general model for all types of consumption since we need to allow for finite 
consumption of some commodities at values of X close to zero. This particular 
difficulty can be overcome by the choice of yet another pair of A values. 


- 


B<0 


(a) (6) 
Figure 3-8 The reciprocal model. 
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Case 3-5: A, = 0, 4, = —1 (Logarithmic reciprocal model). This choice of param- 
eters gives 


In¥=a-B() +4 (3-45) 


Ignoring the disturbance term, we may write Eq. (3-45) as 
Y= eth (3-46) 

Y is not defined for X = 0, but Y > 0 as X > 0, so we can define Y(0) as zero, 
and we then have a function which is right-hand continuous at the origin. From 
Eq. (3-46) 

ay = Bb a— B/X is 

a -(S)e (3-47) 
Thus the slope is positive for positive 8. The second derivative is 


aay Ba 2B arm 
AXA KS XE 

Hence there is a point of inflection where X = B/2. To the left of this point the 
slope increases with X; to the right of it the slope diminishes. As X ~ 00, 
Y > e*. Substituting X = 8/2 in Eq. (3-46) gives the value of Y at the point of 
inflection as 0.135e* or, in other words, about 13 percent of the asymptotic value 
of Y. The general shape of this function is shown in Fig. 3-9. 

If X represents time, Eq. (3-46) then pictures a growth curve which starts at 
zero and approaches an asymptotic level. Rewriting Eq. (3-47) in an equivalent 
form gives 

day |B 

Ydx x2 
that is, the rate of growth in Y per unit change in X is inversely proportional to 
the square of X. Thus the rate of growth falls off sharply after low values of X, as 


Figure 3-9 The logarithmic-reciprocal curve. 
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shown in Fig. 3-9. This can be a disadvantage of the curve, which may more than 
offset the ease of fitting. 

A similar curve, which also has an upper asymptote at some finite level and a 
lower asymptote at zero, but has a more symmetrical shape between the two, is 
the /ogistic. This curve cannot be derived from the general linear relation (3-32) by 
a suitable choice of values for A, and \,. Nonetheless it has been widely used in 
fitting growth trends, and so a brief account of it is given here. The logistic 
equation is 

k 


=-——_— 3-48 
1+ be 4 


where a, b, and k are parameters to be determined. We have written Y as a 

function of time ¢, as this is by far the most common practice, but in some 

applications it is quite feasible to replace ¢ by some independent variable X. 
From Eg. (3-48) it is clear that 


Yi as t> 0 
and Y-0 as t>-© 


so that k is the upper asymptote and zero the lower asymptote. The first 
derivative of Eq. (3-48) is 

dY_a 

army (had) (3-49) 
Thus the rate of change of Y with respect to 1 is proportional to the current level 
Y and also to the distance still to travel to reach the saturation level k. The first 
derivative is positive for all values of t. The second derivative may be written 


(3-50) 


Setting this to zero gives a point of inflection at 


k 
2 


Thus when Y < k/2, the “large” value of k — Y dominates the “small” value of 
Y in Eq. (3-49) and causes dY/dt to increase. As Y increases toward k/2, the 
relative balance of the two forces changes so that dY/dt reaches a maximum 
value when Y = k/2 and thereafter declines steadily as Y rises toward the 
saturation level k. The typical shape of the logistic curve is shown in Fig. 3-10, 
the main contrast with Fig. 3-9 being that the point of inflection occurs at half the 
saturation value rather than at a much lower level. The logistic curve is frequently 
used as a plausible approximation to the growth of any “population,” whether 
bacterial, animal, human, or economic, where growth is thought to be positively 
related to the size of the existing population and negatively related to the current 
distance from a saturation level. 


i Y= 
a 
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Figure 3-10 The logistic curve. 


Estimation of all the parameters a, b, and k cannot be achieved by the simple 
methods that we have described so far. From Eq. (3-49) we can write 


If time is measured in constant units, the left-hand side of this equation is 
approximately AY/Y, the proportionate rate of growth of Y. Thus one might fit 
the linear regression 


i so -(Z) 3-51 
y, =a ra pe ( ) 


which yields point estimates of a and k. To obtain an estimate of b, we note that 


Eq. (3-48) can be rearranged to give 
pa eae (3-52) 


Thus a value of b can be computed for each Y,, given 4 and k. An estimate b may 
then be obtained by averaging some OF all of these computed b values, or 
alternatively by substituting ¥ and / in Eq. (3-52). 1 
There are two main difficulties with this simple procedure for estimating the 

logistic function.} First of all we have point estimates of the parameters, but 
inference procedures are difficult, especially for 6 and k. Second, there is evidence 
that the procedure is unsatisfactory compared with a direct estimation by 
nonlinear methods. 

+See F. R. Oliver, “ Methods of Estimating, the Logistic Growth Function,” Applied Statistics, 
1964, pp. 57-66. This paper describes an iterative program for computing the (nonlinear) least-squares 
estimates. A further paper, F. R. Oliver, “Notes on the Logistic Curve for Human Populations,” 
Journal of the Royal Statistical Society, Vol. 145, 1982, pp- 359-363, gives formulas for the asymptotic 
standard errors of the least-squares estimates. 
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Returning to the model 
y= G+ BX + uy 


we have outlined five special cases, depending on particular choices of 1, 0, or — 1 
for the A parameters. In each case a simple linear relation holds between the 
transformed variables, and the techniques of Chap. 2 may be applied, provided 
the assumptions about the disturbance term are fulfilled. However, the question 
immediately arises: Why restrict consideration to just three possible values for \? 
If the \’s are free to take on any values, the approximation to linearity may well 
be improved, bat we then have five parameters to be estimated, namely, a, 8, A,, 
A, and o,. If the \’s can be estimated and if we can test hypotheses about them, 
we may be able to discriminate between functional forms. This more general 
approach involves nonlinear estimation, which is beyond the scope of this book. 


3-4 THREE-VARIABLE REGRESSION 


As indicated in Sec. 3-2, some nonlinear relations may be represented by the use 
of several explanatory variables on the right-hand side of the relation. For 
example, the conventional total cost function of economic theory might be 
represented as 


TC=a + BQ + yO? + 6Q? (3-53) 
where TC = total cost per period 
Q =output per period 


and a, B, y, and 6 are parameters. From Eq. (3-53), the average cost (AC) and the 
marginal cost (MC) are obtained, respectively, as 


ALG ile 
C =~ =— 2 = 
A 7 @ +(B + 10 + 697) (3-54) 
4 4 
Average Average 
fixed variable 
cost ‘cost 
and 
d(TC 
MC = ue = B + 270 + 389? (3-55) 


With certain restrictions on the parameters, these functions will have the conven- 
tional shapes shown in Fig, 3-11. 
Relation (3-54) shows AC as a linear function of the variables 


1 
Q, oO and Q? 
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t t 


(e} —-Q oO +Q 


Figure 3-11 


More generally the additional explanatory variables need not be restricted to 
transformations of a single variable but will be different variables. For example, 


an expectations-augmented Phillips curve may be written as 


w= e+ A(g] +8(0) 


where w,=rate of wage change 
u,=unemployment rate 
pi =expected inflation rate 


The statistical theory of relationships such as these may be more conveniently 
handled by using the following general notation: 
Y = By + BpXo + BX +77 + BX tH (3-56) 
This states that the dependent variable Y is determined by a linear combination 
of k—1 explanatory variables X2, Xy,..+, X, and a disturbance term. The 
subscripts on the Xs indicate different explanatory variables. As the illustrations 
above show, some of these X *s may well be transformations of other x *s, The 
relation (3-56) is assumed to hold at each sample point. Thus we may write 


Y, = By + By Xo + BsXa, + 77° + BX tty PLZ (3-57) 


or, more compactly, 


k 
Y= DB dvcteen 5 be tie (3-58) 
i-1 
where X21 for all ¢ 


and the interpretation of the double subscript on X is that X,, denotes the jth 
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observation on the variable X,. X, is an example of a dummy variable, that is, a 
variable which takes on a priori specified values (in this case unity) at the sample 
points. 

Equation (3-57) defines the linear model in k variables, and it will be treated 
fully in Chap. 5. As an introduction we deal explicitly in this section with a 
three-variable model, but the concepts developed will facilitate our understanding 
of the k-variable model. 

The three-variable model is specified as 


Y, = Bi + BX, + B,X3, tu, t=1,2,...,n (3-59) 
Replacing the unknown £’s in Eq. (3-59) by any arbitrary set of numerical 
Coefficients b,, b,, and b, gives 
Y, = b, + b) Xy, + b;X3, + e, (3-60) 
The residual sum of squares in Eq. (3-60) is then 


RSS= Yi e? = Yo (¥, — by — by Xp, — by X3,)° 
tl t=1 
= f(b, by, bs) (3-61) 


The least-squares Principle states that b,, b,, and 5; should be chosen to minimize 
the residual sum of squares defined in Eq. (3-61). The necessary condition is that 


A(RSS) _ A(RSS) _ a(RSS) _ 


ab, Cl ler ae 
This gives the normal equations 
LY = nb, + bX, + b,XX, 
EXLY = EX, + DX? + b,DX,X, (3-62) 
UXY = bLX, + DX, X, + b;3LX? 


All the summations merely involve the sample values of X,, X,, and Y, so Eq. 
(3-62) defines a set of three simultaneous equations which can be solved for the 
least-squares estimates b,, b,, and b,.+ These equations parallel those for the 
two-variable case in Eq. (2-13), and the derivation is the same as that shown in 


the footnote below Eq. (2-13). The resultant least-squares regression indicates a 
plane in three-dimensional space, and is written as 


F=b) +b) Xy +X, t= WQvowpn (3-63) 
or, equivalently, 
Y,=¥,+e, bag Us iy Ee (3-64) 


Dividing through the first equation in Eqs. (3-62) by the sample size n gives 
Y=b, + bX, +,X, (3-65) 


+ Again, to keep the notation simple, we are not distinguishing between b, as a variable parameter, 
as in Eq. (3-60), and 4, as the least-squares estimator, defined in Eq. (3-62). 
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Thus the regression plane passes through the point of means. Summing Eqs. 
(3-63) over the sample observations and dividing by n gives 

¥ =), + bX, + bX 
Comparison with Eq. (3-65) shows directly that 


and so from Eq. (3-64), 
e=0 (3-66) 
Thus the mean of the regression values of Y is equal to the mean of the actual 


sample values or, in other words, the sum of the least-squares residuals is 


identically zero. 
Result (3-66) may also be obtained directly from the condition 


a(RSS) _ 
abe 


for 


n n 
RSS) =2¥ (¥,— by — by Xa¢ — bs Xa) = —2 e=0 
1 pat =! 


The equality of the other two partial derivatives to zero gives the important result 
that the least-squares residuals are uncorrelated with X,, X;, and Y. For 


ARS) _ 2 xX,e,=0 (3-67) 
ab; A 
and 
A(RSS) _ _oyx,,e,= 0 (3-68) 
ab; 
Further 
Le, ¥, = b\Le, + bE Xe, + BL X3,e, = 9 (3-69) 


since each term on the right-hand side is zero. 

Returning to Eq. (3-64) we can write it as 

Yat er 

where y = Y — Yand f = ¥ — Y. Squaring and summing over all observations 
gives 


(3-70) 


Ly? = Lp? + Le? (3-71) 
since from Eq. (3-69) 
Lev =Le(p+ Y)=Lep=0 
Relation (3-71) is the familiar decomposition of the total sum of squares 
TSS = ESS + RSS 
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and it suggests the definition of the multiple correlation coefficient R as 


7 cS —_— e 
R “TSS ~ Sy? 1 Ey? (3-72) 
This coefficient is sometimes written as R,,3, indicating that it relates to a 
regression where X, and X, are the explanatory variables. Similarly, for explana- 
tory variables X,, X3,..., X, it would be written R,,, ,, but in most cases there 
is no ambiguity and the subscripts are not inserted. 
It is illuminating to note that R? is also equal to the square of the simple 
correlation coefficient between Y and ¥. From Eq. (2-20) the latter coefficient 
may be written 


Pred (Zyp)? 
% (Ey*)(E9) 
But 
Lyp = EP(p +e) = Ly? 
Thus 
2 DS” 2 


The normal equations in Eqs. (3-62) may also be simplified if expressed in 


deviation form. Substituting Y = 5, + 6, X, + b,X, in the second and third 
equations gives 
Ux,y = bEx} + b:Ex2x5 
Exyy = b,Ex,x; + bsDxF 
The calculation of ESS, RSS, and R? is then easily made in terms of deviations: 
ESS = Ly? 
= LP(b, x, + b;x,) 
= bYx, 9 + bsLxsh 
= b,Ex,y + b,Dx,y (3-74) 


Thus once b, and b, have been obtained from Egs. (3-73), ESS is easily computed 
from Eq. (3-74), and finally 


(3-73) 


RSS = Ye? 
= Ly? — (b,Ex,y + bsEx;y) (3-75) 
and 
R= zy? 
Ly? 


The exposition so far has been slanted toward the calculation of estimates by 
hand or on a desk calculator. This may seem uncalled for in an era when there is 
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almost universal access to large-scale computers, but there is no one more 
dangerous than the unthinking user of a computer, who has no real understanding 
of the nature of the computations being processed inside the machine. The one 
sure way to that understanding is to process some fairly simple examples 
thoroughly and carefully. 


Example 3-3 Figure 3-12 shows a scatter diagram of w, versus u, for the first 
16 quarters of the data, that is, from 1954: 2 to 1958: 1. It is suggestive of a 
nonlinear negative relationship.} Fig. 3-13 shows the scatter of w, plotted 
against the reciprocal of unemployment. The slope is now positive and the 
scatter approximately linear. The simple regression fitted to these data gives 


w, = —9.7530 + 62.9450( 1 with r? = 0.8229 
t 
Computing the same regression for all 26 quarters up to 1960: 3 gives 
vw, = — 1.4486 + 2.204{ +) with r? = 0.4235 
t 


_ . + The nonlinearity is, in fact, very slight. A linear regression of w, on u, bas an r? = 0.8192, which 
is negligibly smaller than the r? of 0.8229 for the linear regression of w, on 1/u,- 
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We see that the coefficients change substantially and the fit to the longer 
period is much worse than that to the shorter period, This is just the first 


and testing for structural changes in the hypothesized relationship. 

Perry introduced the lagged inflation rate as an additional explanatory 
variable, assuming it to be a good proxy for the expected rate of inflation. 
The resultant multiple Tegression for the early period 1954: 2 to 1958: 1 is 


W, = —9.7396 + 62.205 +) ~ 0.0054p,_, with R? = 0.8230 
t 
and for the complete sample period it is 
W, = — 1.7220 + 25.3698( 1 + 0.3069p,_, with R? = 0.5197 
p 


We see that for the early period the addition of the lagged inflation rate gives 
no improvement in the regression: the Square of the multiple correlation 
coefficient is identical to the third decimal Place with the square of the simple 
correlation coefficient, and the coefficient of the inflation rate is negligible in 
size and perversely signed. For the complete period, the addition of the 


EXTENSIONS OF THE TWO-VARIABLE LINEAR MODEL 81 


inflation rate yields some improvement in an initially rather unsatisfactory 
equation, and the coefficient of the lagged inflation rate is now correctly 
signed and no longer numerically negligible. We are not yet in a position to 
apply inference procedures to a multiple regression equation, but this exam- 
ple should alert us to the possibility of substantial variations in estimated 
relationships as a data base is expanded or contracted. 


The b’s are point estimates of the hypothetical B’s, and all the usual inference 
questions arise, such as those discussed in Chap. 2. There is little point in working 
out these questions explicitly for the three-variable case, for in Chap. 5 we will 
develop all the required inference procedures for the general case of k variables. 
However, before leaving the three-variable case there are some additional alge- 
braic relations to develop, which will also contribute to our understanding of 
more complicated relationships later. 

With three interrelated variables Y, X,, and X, there are three simple 
correlation coefficients denoted by 

Tips M3» and rh 

where the subscript 1 refers to the Y variable. The techniques of Chap. 2 would 
also enable us to compute various regression slopes such as 

b, = slope of regression of Y on X, 

b,, = slope of regression of ¥ on X; 

b,, = slope of regression of X, on X; 
and there are, of course, three further regression slopes where the order of the 
subscripts on b is reversed.¢ The first question to explore is the connection 
between b, and b,, the slopes of the regression plane, and the simple b’s, the 
slopes of the various two-variable scatters. Solving Eq. (3-73) for 5, gives 
Rea Ux, ylxj — Lxsyhxoxs 

* ExdExd = (Ex)! 

Dividing top and bottom by £x35xj then givest 


2 Disbsz _ M2 — M1323 St (3-76) 
21 = by3b32 1-7” 3 
Similarly, 
— Dis = Prades _ as 7 aks $1 (3-77) 
31 = by3b32 1-r, 


+ In the regression Xj on X; the residuals are measured in the X2 direction, while in the regression 
of X; on X; the residuals are measured in the X3 direction. f 

+ This follows directly by applying Eqs. (2-14a) and (2-19), that is, 

Lyx2 = Dx3X2 z Ex2x3 Sli. 

D1 3 = 1g AED a 


by 
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Thus the coefficient of X, in the multiple regression can be regarded as being 
based on the coefficient in the simple regression of Y on X,, subject to a 
correction for the presence of X;. If it should happen that X, and X, are 
uncorrelated in the sample, that is, Dx,x, = 0, then by3 = b3, = 0, and the 
correction factor is zero, so that 
b, = by, b, = by 

The explanatory variables in this case are said to be orthogonal, and the multiple 
Tegression coefficients coincide with the simple regression coefficients. An older, 
but expressive notation for b, and b; is bj, and 6,3, tespectively. A coefficient 
like bj) is then described as a zero-order coefficient and bj23 as a first-order 
Coefficient, the number after the decimal point in the subscript indicating the one 
other explanatory variable present in this regression. Equations (3-76) and (3-77) 
thus indicate the connection between zero-order and first-order regression coeffi- 
cients. It is natural to inquire if there is a similar hierarchy of correlation 
coefficients. The simple coefficients r,,, 1\3,-. are defined as zero-order correlation 
coefficients. What then is the definition of a first-order correlation coefficient, and 
what meaning attaches to it? 

This question may be approached in the following way. Suppose X, influences 
Y, but X;, also influences Y. The simple correlation coefficient between Y and X, 
is by definition, 

Lyx 


= fey? Ex? 


Part of the variation in Y is really due to X;, and so r,, will not correctly measure 
the correlation attributable to X,. If one ran a linear regression of Y on X;, the 
residuals are given by 


Y — bx; 
This series represents the variation in Y left over after the linear effect of X, has 
been removed. Correlating these residuals with X, gives 


L(y - b13x3) x2 
VZ(y - by3x3)° Dx} 


If X, and X, were uncorrelated in the sample, this coefficient simplifies to 
Lyx, 


VECr = b 1335)? (Ex? 


which will be numerically greater than the simple coefficient 12. However, X, and 
X;, are in general correlated, and the Proper correction procedure is to remove the 
linear effect of X, from both X, and Y and then to correlate the resulting 
residuals. This gives 


, 


Labi 


ees 
LA Sie 


i X(y- b13x3)(x. — 3X3) 
Ey * by3x3)° VEC, c by3x3)? 


"123 


(3-78) 
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Equation (3-78) defines a first-order correlation coefficient. It measures the 
correlation remaining between Y and X, after the linear effect of X; has been 
removed from each. It is more generally known as a partial correlation coefficient. 
Equation (3-78) simplifies tof 


Tia — Ti3!23 
Naa = (3-79) 
<i yl-r3yl-m 
Similarly, 
Ths — Mi2!a3 (3-80) 


N32 > a = = Zz 


This latter expression may also be derived from first principles by finding the 
simple correlation between 


(y — 2x2) and (x3 — 322) 


or, more simply, by interchanging subscripts 2 and 3 in Eq. (3-79). A partial 
correlation coefficient between any two variables thus measures the correlation 
between the residuals left in each variable after the linear effect of all other 
explanatory variables has been removed. 

The multiple regression slopes b, and bs, defined in Eq. (3-73), may also be 
interpreted in terms of these residuals. Regressing (y — b,3x3) on (x2 — by3X3) 
gives a regression slope of 


E(y — bis¥s)(x2— busts) _ EO = bats) since x, is uncorrelated with 
E(x, — basa)” U(x) — ba3%s) the residuals (x2 — b,3x3) 


_ Lyx, Ex5 — Lyx3LX2%3 
Exjlx3 — (Ex_x3)° 
=b, 


+ The explicit derivation is as follows. Using Eqs. (2-144) and (2-20), 


numerator of 73.3 = LyX2 — bysEx2x3 — bash yx3 + by3b23x3 


5 32 : 

= nry25\82 — ia 5 Maa?285 = 5, nr38183 

5)S- 

+ rir ass 

53 
= 5 89(Ti2 — Mi3"23) 
f the Y values, s, the sample standard deviation of 
the X, values, and so on, In the denominator “wy - by3x3)? is the residual sum of squares in Wis 
regression of ¥ on X;. From Eq. (2-20) this may be written as ns7(1 — ri). Likewise U(x2 Fast) 
is the residual sum of squares in the regression of X on X, and may be written as ns3(1 — rj3)- Thus 
the denominator of Eq. (3-78) is ns,52) {1 — 72, V1 — 7, and Eq. 3-79) follows. 


where s, denotes the sample standard deviation o! 
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from Eq. (3-73). Likewise, 5, is just the simple regression slope obtained when the 
residuals ( y — b,x>) are regressed on the residuals (%3 — b3)x>). 

Finally, the multiple correlation coefficient Rj»; is a second-order correlation 
coefficient, and it may also be expressed in various ways in terms of lower-order 
coefficients. For example, using Eq. (3-74) gives 


= b,Exry + bsExzy 
Ly? 
Substituting for b, and b, from Eqs. (3-76) and (3-77) gives 


2 
R 1.23 


wre 
na + M3 = 2ransns 
Ria = Sr iuie ae orga (3-81) 
1—'r 


The buildup of the explained sum of squares may also be looked at in sequential 
fashion, and this is illuminating for the analysis of variance treatment in Chap. 5. 
Suppose one first regressed Y on X,. Then 
rLy* = ESS due to the regression of Y on xe 
and 
Ly?(1 — r) = RSS from the regression of Y on x 


This latter quantity is the sum of squares still to be explained. Regressing the 
residuals (y — b,x) on the X, residuals, namely, ( 3 — b3)x2), then gives 


ris22y?(1 — 3.) = ESS in the regression of the Y residuals 
on the X, residuals 
and 


(1- rh )ey?(1 — r},) = RSS in the regression of the Y residuals 
on the X, residuals 


Aggregating the ESS at each Stage gives a total explained sum of squares in Y as 


Ly*[rB + r3,2(1 —7rh)] (3-82) 
Substituting for rj in Eq. (3-82) from Eq. (3-80) and using Eq. (3-81) gives 
Ly*[r, + ria(l = rh) ] = Ly?Ri as (3-83) 


But Ly?R?,3, by definition, indicates the sum of squares in Y explained by the 
multiple regression of Y on X, and X;. Equation (3-83) thus shows that the 
multiple regression ESS can be regarded as built up in two steps, first the ESS due 
to the simple regression of Y on X, and second the ESS due to the simple 
regression of the Y residuals on the , residuals. The increment in the ESS due to 
adding X, to the regression is 
Ly?(R2y a rm) 
which, from Eq. (3-83), may be written 


r2by*(1 rh) 
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where Dy2(1 — r3) is the RSS after Y has been regressed on X,. Thus rj, 
indicates the proportion of the variation left after the simple regression on X,, 
which is explained by adding X, to the set of explanatory variables. A similar 
development and interpretation may be made by starting with the simple regres- 
sion of Y on X; and then adding X, to the explanatory variables. 


Example 3-4 Denote the variables in Table 3-7 by 


l=w, 
2=u,' 
3=P,-1 


Table 3-7 Wage change, unemployment, and price change in the United Statest+ 


Year Quarter Ww, u, Pr-a 
1954: 2 3.53 0.2312 1.229 
3 1.74 0.1942 0.702 

4 1.72 0.1794 0.000 
1955; 1 1.71 0.1827 = 0.522 
2 2.27 0.1942 — 0.260 

3 4.57 0.2128 0.173 

4 4.52 0.2260 0.090 
1956: 1 5.06 0.2353 0.699 
2 5.56 0.2367 0.263 

3 437 0.2367 1.048 

4 5.95 0.2381 1.997 
1957: 1 6.42 0.2395 2.507 
2 5.26 0.2424 3.447 

3 5.24 0.2410 3.589 

4 4.08 0,2286 3.376 
1958: 1 3.52 0.2010 3.023 
2 3.50 0.1747 3.415 

3 3.48 0.1544 3.221 

4 2.94 0.1460 2.297 
1959: 1 3.88 0.1487 1,966 
2 4.35 0.1613 0.814 

3 2.88 0.1747 0.404 

4 3.33 0.1810 0.967 
1960: 1 3.24 0.1878 1.367 
2 2.78 0.1869 1.528 

3 3.74 0.1869 1.762 


+ w=four-quarter percentage change in hourly earnings of production workers in all manufac- 
turing 
u = unemployment as a percentage of the civilian labor force 
p =four-quarter percentage change in the consumer price index (CPI) f 
Source: G. Perry, Unemployment, Money Wage Rates and Inflation, MIT Press, Cambridge, 
MA, 1966. : 
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For the complete sample period the zero-order correlation coefficients are 
Ty = 0.6508 73 =0.3567 1; = 0.0726 


rh = 0.4235 rr, = 0.1272 


_ 0.6508 = 0.3567(0.0726) = «ng 
Vi = 0.3567)" [1 - (0.0726)? 

0.3567 = 0.6508(0.0726) 

Tf = (c.6508)7] fi = (0.0726))] 


and Thy = 0.4498 rs, = 0.1670 


N23 


= 0.4087 


and Rf; = 0.5197. In this example the two explanatory variables are practi- 
cally uncorrelated, so there is little difference between the zero-order and the 
first-order correlation coefficients. Unemployment alone explains over 42 
percent of the variation in wage change, price change alone accounts for 
about 13 percent, and the two variables jointly account for about 52 percent. 
Unemployment accounts for 45 percent of the variation unexplained by price, 
and price accounts for about 17 percent of the variation unexplained by 
unemployment. 


PROBLEMS 


3-1 Given five observations u_>, w_j, uo, uj, and uz at equally spaced points of time f= 
—2,-1,0,1,2, show how to fit a parabola to the observations by least squares and show that the 
value given by the parabola at time ¢ = 0 is 


35(—3u_2 + I2u_, + 17up + 12m, — 3uy) 


(R.S.S. Certificate, 1955) 
3-2 The “firmness” of cheese depends upon the time allowed for a certain process in the manufacture. 
In an experiment on this topic, 18 cheeses were taken, and at each of several times firmness was 
determined on samples from three of the cheeses. The results (on an arbitrary scale) are given below: 


Time, 

h Firmness 

} 102 105 115 
1 110 120 115 
Id 126 128 119 
2 132 143 139 
3 160 149 147 
4 164 166 172 
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Estimate the parameters in a linear regression of firmness on time. Giye standard errors of the 
estimates and test the adequacy of a linear regression to describe the results. 
(R.S.S. Certificate, 1955) 


3-3 Discuss briefly the advantages and disadvantages of the relation 
v, = a + Blog v9 


as a representation of an Engel curve, where 0; is expenditure per person on commodity i and vp is 
income per person. Fit such a curve to the following data, and from your results estimate the income 
elasticity at an income of £5 per week. J 


Pounds per week 


0; 0.8 12 ifSs See be) 2.6 3.1 
oy LT) in 2Tiaicdiben BP 5 his Os tireiy S-b 12.0 


Does it make any difference to your estimate of the income elasticity if the logarithms of vo are 


taken to base 10 or to base e? Explain carefully. 
(Manchester University, 1956) 


3-4 Response rates at various levels of ratable values. 


Range of ratable value A By CagD: Ee haG HT oS. 


Assumed central value X, £/annum 3 7 12 17 25 35 45 55 70 120 
Response rate Y, percent 86 79 76 69 65 62 52 SI Sl 48 


The data relate to a survey recently conducted in England, Estimate the constants in the regression 
equation 
100 b 
jo-¥~** x 
(Oxford University, 1955) 
3-5 X,, X, and X; are three correlated variables, where s, = 1, 52 = 1.3, 53 = 1.9 and rj. = 0.370, 
n3 = —0.641, and r,, = —0.736. Compute 743.2. If X, = X, + Xp, obtain rap, M3, and 43,2. Verify 
that the two partial coefficients are equal and explain this result. 
(UL, 1952) 
3-6 In the regression equation 
y= Bx + Yat END 


all variables are expressed as deviations from their sample means. Consider the following alternative 
Procedures for estimating B. _ 

(a) Calculate the estimates B and 7 in a regression of y on x, and x3. 

(b) Regress y on x2 and calculate the regression residuals y*; regress x, on x2 and calculate the 
Tegression residuals x#,; regress y* on xf to obtain an estimate b of B. 

Show that the two procedures give the same result, that.is, B = b. 

Show that the regression residuals given by each procedure, that is, 


(wa Bx, =x, and yh bxi, 


are the same. 
(UL, 1969) 


88 ECONOMETRIC METHODS 


3-7 Outline the properties of the following functions, and sketch their graphs: 
(a) y=a + Bln x 


ia siereaers 
eat hx 
(cy 1+ ett Be 


Find transformations which linearize functions (b) and (c), that is, for each function find a pair of 
transformations f(x) and g(y) such that g(y) is a linear function of f(x) and the a, 8 parameters 
may be estimated. 

3-8 Your research assistant reports the following results in several different regression problems. In 
which cases could you be certain that an error had been committed? Explain. 

(a) R}23 = 0.89 and R7 234 = 0.86 

(b) r?, = 0.227, ry = 0.126, and R72; = 0.701 

(c) (Lx? (Ly?) — (Lxy)? = = 1,732.86 

(University of Michigan, 1980) 
3-9 Sometimes variables are standardized before the computation of regression and correlation 
coefficients. Standardization is achieved by dividing each observation on a variable by its standard 
deviation, so that the standard deviation of the transformed variable is unity. If the original relation is, 
say, 


Y = B, + B)X_ + B3X3 + 
and the corresponding relation between the transformed variables is 
Y* = Bi + BEX$ + BIXS + ut 
where 
boda (FOP. tata, (/, PM a Te) 


what is the relationship between 83, ;, and B,, 83? Show that the partial correlation coefficients are 
unaffected by the transformation. 


CHAPTER 


FOUR 
ELEMENTS OF MATRIX ALGEBRA 


It is clear from the last section of Chap. 3 that it would be excessively tedious and 
complicated to build up to the general case of k-variable regression in a stepwise 
fashion. Fortunately, by the use of matrix algebra we have a compact and 
powerful way of treating the problem, and we shall see that the detailed results of 
the previous two chapters are merely special cases of a few simple matrix 
formulas. The rest of this chapter presents the elements of matrix algebra that are 
necessary for following the treatment in the remainder of the book. Most or all of 
this chapter may be skipped by those with adequate previous knowledge of 
matrices. For those whose knowledge is somewhat rusty it may hopefully serve as 
a useful review, but every attempt has been made to make the material accessible 
to a student with no prior knowledge of matrices, for the subject is so fundamen- 
tal to modern econometrics (and economics) that no serious student can afford to 
be without it.+ 
" Suppose our theory suggests that a dependent (explained) variable Y is a 
linear function of k — 1 independent (explanatory) variables X , X3,---5 Xx: 


ige of matrix algebra are likely to get indigestion if 


+ Nonetheless students with no prior knowled; 
his chapter before proceeding with the rest of the 


they attempt to work through all the material in # 
book. The topics are introduced approximately in the order in which they will appear in subsequent 


chapters. Thus the students should interact between this chapter and those that follow, learning 
enough from Chap. 4 to proceed with Chaps. 5, 6, and so on, and returning to Chap. 4 as necessary. 
Summaries of results have also been inserted at various stages in the chapter. 
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Allowing for a constant term, we would write the function as 
Y=, + BX, + BX, +--° + BX + u 
If we have n sample observations, the model gives rise to the following set of n 
equations: 
Y, = B, + BX, + B3X3, + -°* + BY + 
Y, = By + ByXy) + B3X32 + +++ + ByXp2 + Uo 


Y, = By + ByXon + Bs X3q + °° + By Xin + Un (4-1) 
As a first step we rewrite these n equations in the form 
Y, 1 Xy Xy ts Xa |] Bi uy 
Y, VX Xsg Xa 1B: WD 
Pa ba 2 Ag; 3 mf (4-2) 
Y, 1X, Xay Xin || Be uy, 
or 
y=XBt+u (4-3) 


where the four boldface symbols correspond to the four sets of elements that have 
been enclosed in square brackets in Eq. (4-2). These symbols indicate vectors and 
matrices. For example, 


is an n-element column vector, with the sample observations on Y arranged in a 
specific order in the form of a column. Likewise B is a k-element column vector, 
containing the coefficients of the hypothesized relation, and u is an n-element 
column vector, containing the n unknown disturbances. 


1 Xa Xs Xx 
Sole oe ¥ia 
1 Xy, Xs, Xu 
lniXgyen) Xa Xen 


is a matrix with n rows and k columns. A matrix is simply a rectangular array of 
elements, and we say that X is a matrix of order n X k to indicate that the 
rectangular array in X has n rows and k columns. In stating the order of a matrix, 
the number of rows is always given first and the number of columns second. 
Clearly, a column vector is merely a special case of a matrix, namely, a matrix 
with only one column. Likewise, a row vector such as 


{1 Xy Xs Xa] 
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is another special case, namely, a matrix with just one row.} We may thus look at 
the X matrix in two ways, as an ordered collection of column vectors or as an 
ordered collection of row vectors. Each column vector, apart from the first, 
denotes the sample observations on a particular explanatory variable. Thus, for 
example, 
Xs 
Xx 
xz= 
Xn 
denotes the sample observations on the variable X;. The first column is a 
collection of units and, as we will see, is required in order to incorporate the 
intercept 8, into the regression. Using this notation for the column vectors, we 
could express X as 
log iat | 
X=]X% X27 Xe (4-4) 
bal | 
where each x, is an n-element column vector and x, indicates a column of units. 
The rows of X indicate observations on all explanatory variables at a particular 
sample point plus a unit in the first position. Thus if the data are in time series 
form, the first row indicates the X values in the first time period, the second row 
the values in the second time period, and so forth. Thus using s, to indicate the 
vector of observations at the ith sample point, the X matrix may also be expressed 


ast 


X= . (4-5) 

pss agnor 
We have inserted vertical and horizontal lines in Eqs. (4-4) and (4-5) to emphasize 
that the first is a representation of X in terms of column vectors and the second a 


Tepresentation in terms of row vectors. In practice one usually writes these 
expressions more compactly as r 


X ze Dxy, Souci xe 


it being clear from the context which are row and which are column vectors. 


+ We will adhere to the convention of indicating a matrix by an uppercase boldface letter and a 


vector by a lowercase boldface letter. 
+ We have indicated the row vectors by the letter s since they correspo! 
common notation is to let x,, indicate the ith row of the X matrix and X..; 


nd to sample points. A more 
the jth column. 
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Equations (4-2) and (4-3) are equivalent ways of stating Eq. (4-1). For this to 
be so, operations on matrices must follow certain simple rules, which we will now 
describe. ft 


4-1 OPERATIONS ON VECTORS AND MATRICES 


The right-hand side of Eq. (4-3) indicates two elementary operations on matrices, 
namely, multiplication and addition. 


Matrix Multiplication 


Matrix multiplication is achieved by repeated applications of vector multiplica- 
tion. The multiplication of an n-element row vector into an n-element column 
vector is defined as follows: 


[a, 4, +: a,]] . | =4)b,+4,b)+---+a,b,= La,b, (4-6) 
: i=l 


b 


that is, corresponding elements are multiplied together and the results summed. 
As a numerical example, 


1 
[2 3 -14] = 2(1) + 3(4) + (—1)(5) = 9 
5 


As the definition and the example show, multiplying a 1 x n vector into ann X 1 
vector produces a 1 x 1 vector, or a scalar quantity. Notice that the operation is 
not defined if the number of elements in the two vectors is not the same. It is also 
clear from the definition that 


a, b, 
a, b. 
[ob BM P= Ta, ay es al] 
a, b, 
Suppose, however, we define the vectors a and b as column vectors, that is, 
a b 
ay b, 
a=|. b=} - (4-7) 
a, b, 


The above multiplication definition cannot apply directly to a and b since they are 
both column vectors. We thus define an operation of transposition, which turns 
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column vectors into row vectors, and vice versa. Thus 
Transpose ofa=a’=[a, a, --- 4a,] 


Transposition does not change any of the elements of the vector; they are merely 
written in the original order, but as a row instead of a column, or vice versa. It is 
also clear that transposing a’ gets us back to the original column vector. Thus 


(a) =a 
Some writers use a superscript T to indicate a transpose, that is, a’ = a’, but we 


will use a prime. We will also sometimes find it useful to take the transpose of a 
1 X 1 vector, or scalar, and clearly, that leaves the scalar unchanged. 


Scalar, Dot, or Inner Product 


In general, unless we specifically state the contrary, a vector symbol will indicate a 
column vector, as in Eq. (4-7). The multiplication operation already defined in 
Eq. (4-6) then enables us to define the scalar, dot, or inner product of the two 
vectors a and b in Eq. (4-7) as 
n 
a’b = b'a= Dab, (4-8) 
i=] 
If we have two matrices A and B, where A is of order m X n and B is of order 
n X p, the product AB will be a matrix C of order m x p, that is, 
AB = C (4-9) 
(mxnynxp) (mp) 
If we indicate the rows of A by a, (i= 1,-..,m) and the columns of B by b, 
(j = 1,..., p), each (scalar) element in C is the inner product of a row vector 
from A and a column vector from B, namely, : 


ec! 2 bu | fajonty | a,b, a,b, a,b, 
a b, b, b, | =| 2b, a,b, ab, (4-10) 
1 | 1] [andy tgby = Bnby 


m 


The basic rule embodied in Eq. (4-10) is 
Element in i, jth position in AB = inner product of row 7 of A and column j of B 


Example 4-1 


12 aq ©] _ fray +200) +3(1) 106) +20) + 3(1) 
Ee alle 1) = [2(a) + 0(0) + 42), 2(6) +00) + 40) 


ati 
an -[f 16 
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Example 4-2 
eG 1(1) + 6(2) 1(2) + 6(0)_1(3) + 6(4) 
F ie 2 3| =| 0(1) + 1(2) 0(2) +1(0) 0(3) + 1(4) 


death pai 1(1) + 1(2) 1(2) + 1(0) 13) + 1(4) 
130t227 
BA=| 2 0 | 
3 iiiec2 agen 


As the examples and the definition in Eq. (4-10) make clear, the order in 
which matrices are multiplied is of crucial importance: 


AB indicates that A is postmultiplied by B or, equivalently, that B is premutlti- 
plied by A 


This operation is only possible if the inner products of rows of A and columns of 
B exist, that is, if the number of columns in A is equal to the number of rows in 
B. In this event the matrices are said to be conformable. The simplest check is to 
write down the order of the two matrices to be multiplied, as in Eq. (4-9), and it is 
seen that the common index n disappears to give a product matrix of order 
m X p. The product BA would only exist if p = m so that the inner products of 
the rows of B and the columns of A could be formed. Note carefully that the 
definition of matrix multiplication involves the inner products of the rows of the 
first matrix and the columns of the second. 

A special case of Eq. (4-10) occurs when one of the matrices is simply a 
vector. For example, 


ee a,b 

— a —|II a,b 
a 5 =|. [=e (4-11) 
: | : 

=) a, a,,b 


and ¢ is an m X 1 (column) vector. 


Returning now to Eg. (4-2), the right-hand side incorporates the multiplica- 
tion of a matrix by a vector, and applying the above rules gives 


i S 
xp = [Pit BX FBX + ot BX (4-12) 
1+ ByXan + ByXoq +--+ + BX pn 


which is just an n-element column vector. 


Matrix Addition 


The right-hand side of Eq. (4-2) or Eq. (4-3) is now seen to consist of the addition 
of the two vectors XB and u. This addition is simply achieved by adding 
corresponding elements. Thus the operation is only defined for vectors with the 
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same number of elements. In general, 


a, b, a, +b, 
a, b, a,+ b, 

atb=| J+] ./= (4-13) 
a, b, a, i b, 


The definition in Eq. (4-13) is readily extended to the addition of matrices. Two 
matrices A and B can be added together only if they are of the same order m X n. 
The sum matrix is also of the same order, and each element in it is simply the sum 
of the corresponding elements in A and B. 

Applying Eq. (4-13), the right-hand side of Eq. (4-3) reduces to an n-element 
column vector, 


Equality of Matrices 
Finally, Eq. (4-3) states an equality between y and XB + u. This simply means 
that the first element is y, equal to the first element in XB + u, and so on through 


all n elements, that is, 
Y, = Bj + BX), + °*° 
Y, = By + B, Xp, 
and so we are back to the n equations of Eq. (4-1) and have shown that the simple 
rules of matrix addition and multiplication enable us to write the system (4-1) in 


the compact form 


y=XBpt+u 


Further Remarks on Matrices 
Our primary purpose is to analyze the model y = XB + u by least-squares 
techniques, but in order to proceed with that we need to develop some further 


properties of matrices. ’ yors 
Transposition has already been defined for vectors. Since a matrix Is a 


collection of vectors, we can define the transpose of a matrix. Let A be anm Xn 
matrix, which we write in the form 


(mxn) 


Nib! | 

to indicate the m-element column vectors. The transpose A’ of A is defined as 
—e AP re 

— a — 


—habi= 
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that is, the first row of A has become the first column of the transpose, the second 
row of A the second column of the transpose, and so forth. The definition might 
equally well have been stated in terms of the first column of A becoming the first 
row of A’, and so on. Clearly, A’ is of order n X m. 


Example 4-3 


it py Buganitg a 
Aelapuglog és 


A symmetric matrix is defined to be one for which 


A’=A 
that is, 
aj; = aj; fori +j 


This property can only hold for square matrices (m = n), since otherwise A’ and 
A are not even of the same order. 


Example 4-4 


A= 


aa 4 
= 0 2 ieee, 

4 3 2 
From the definition of a transpose it is immediately obvious that the following 
two properties hold: 

(A)'=A . 
that is, the transpose of the transpose equals the original matrix, and 
(A+B) =A’+B’ 

that is, the transpose of a sum is the sum of the transposes. Somewhat less 


obvious is the interpretation of (AB)’, the transpose of the product AB. Referring 
back to Eq. (4-10), we note again that the i, jth element in C = AB is 


¢,; = a,b, = inner product of ith row of A into jth column of B 
t= 15 ...,.m3j = 1,.-.,P 
Transposition of C means that the i, jth element in C’ is the j, ith element in C. 
Using cj, to denote the i, jth element in C’ gives 


HE ayes: 
Cy = Cj, = a,b; 


But 
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from the definition of an inner product. Thus 
cj, = bja’, = inner product of ith row of B’ and jth column of A’ 


and so, from the definition of matrix multiplication, 


C’ = (AB) = BYA’ (4-14) 
This rule extends immediately to any number of matrices. Thus 
(ABC)’ = C’BYA’ (4-15) 


since 
(ABC)’ = C’(AB)’ 
= CB‘A’ 
by repeated application of Eq. (4-14). 
The associative law of addition holds for matrices, that is, 


(A+B)+C=A+(B+O) (4-16) 


This result is obvious since matrix addition merely involves adding corresponding 
elements, and it does not matter in what order the additions are performed. 
The associative law of multiplication also holds, that is, 


(AB)C = A(BC) (4-17) 
where A, B, and C are assumed to be of the appropriate order for multiplication. 
To prove this result, we will show that the i, jth elements on each side of Eq. 
(4-17) are the same. The ith row of the product AB is given by 

[ab, ab, ---]=a[b, b, ---]=aB 
where a, denotes the ith row of A and b,,b,,... denote the columns of B. Letting 
¢, denote the jth column of C, the i, jth element of (AB)C is then 
a, Be, 
Similarly, the jth column of the product BC is 
Be; 
and so the i, jth element of A(BC) is 
a, Be, 
The distributive law also holds for matrices, that is, 
A(B + C) = AB + AC (4-18) 
To see this, let a, denote the ith row of A and b, and ¢; the zi th columns of B and 
C. The i, jth element on the right-hand side of Eq. (4-18) is then the scalar 
a,b, + a,c 
and by the application of the distributive law for scalar algebra this is clearly 


equal to the inner product of a, and the vector b, + c;, the jth column of B + C, 
which gives the i, jth element on the left-hand side of Eq. (4-18). 
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If a matrix is multiplied by a scalar, then every element in the matrix is 
multiplied by the scalar. For example,} 


-2} 2 3)-[22 -4 -6 
20 4 -4 0 -8 


There are some square matrices of particular importance. First is the unit or 
identity matrix of order n X n, 


100 0 
1,=|0 1 0 0 
p66 i 


with units down the main, or principal, diagonal and zeros everywhere else. As we 
shall see, it plays in matrix algebra a role similar to that of unity in scalar algebra. 
As one illustration, 

IA=AI=A 
that is, pre- or postmultiplication by I leaves any matrix unchanged, as may 
readily be verified by multiplying out IA and AI. Thus the unit matrix may be 
entered or suppressed at will in matrix expressions. For instance, 


Yay We ey. 
7 ie EY 


A diagonal matrix is like the identity matrix in that all off-diagonal terms are 
zero, but now the diagonal elements are scalar quantities, one of which at least is 
nonzero. The diagonal matrix may be written 


A ituncOioes 0 0 
0. Ana 0 
aad wed S 0 ie 
pie ae saiheng as a 
or, more compactly, A = diag{A, A, --- i,,). Examples are 

2 0 0 

(Ss) me fo ol 

0 0 5 


A special case of Eq. (4-19) occurs when the A’s are all equal. This is termed a 
scalar matrix and may be written 


A 0 0 
OA 0/=Al 
Gov gett Situs " 


+ The scalar may be placed in front of or behind the matrix. 
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Another special case of a square matrix is an idempotent matrix. Let A be a 

square symmetric matrix, so that 
A'=A 
If A is idempotent, then 
A=M=A=-::- 

that is, multiplying A by itself, however many times, simply reproduces the 
original matrix.¢ As we will see later, idempotent matrices play an important role 
in statistical theory. An example of an idempotent matrix is 


iL =e 1 
7) Axons 
Aoylt 32! 1 


A= 


as the reader can easily verify by multiplication. 
Another important matrix is the null matrix 0 whose every element is zero. 


Obvious relations are 
A+0=A 


and 
A0=0 


Similarly, we may have null row or column vectors. 


Partitioned Matrices 
Writing the matrix X in the form 
X=[x, x. <7: Xl 


as in Eq. (4-4), is a special example of a partitioned matrix. The elements on the 
right-hand side are not scalars but vectors. In general a partitioned matrix 
contains submatrices as elements. The submatrices are obtained by partitions of 
the rows and columns of the original matrix. For example, 


45 Oesaieatile, HA GLeN 
aeta| 0G Sgt OT -|t He (4-20) 
-3- Soe a seen) fe 
23 aatlones 


where 


(4-21) 
A, =[-3 2 0] An =5 

The dashed lines indicate the partitioning, yielding the four submatrices defined 

in Eqs. (4-21). 


+A square nonsymmetric matrix is idempotent if it satisfies A? 
symmetric idempotent matrices in this book. 


=A, but we will only meet 
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Our previous rules for the addition and multiplication of matrices apply 
directly to partitioned matrices provided the submatrices are of appropriate dimen- 
sion. For example, if A and B are both written in partitioned form as 


Ai; al B, a 
A= and B= 
. An B, By 
then 
A, +B, A, + By 
A, + B, An + B,, 


provided A and B are of the same overall order (dimension) and each pair A, jp By 
is of the same order. As an example of the multiplication of partitioned matrices, 


A+B= 


An An B. OB 
AB=|A,, Ay B., “9 
Ae oA 1 2 


A,B, + A,B, A) By + A,B, 
=| Ax By + AB, A,B, + AxB» 
A3,By, + A.B, A3,By. + AyBy 


For the multiplication to be possible and for these equations to hold, the number 
of columns in A must equal the number of rows in B and the same partitioning 
must be applied to the columns of A as to the rows of B. 


Summary on Matrix Operations 
The main results of this section are summarized as follows: 


1. The scalar, dot, or inner product of two n-element column vectors a and b is 
a’b = b’a = D_ a,b. 

2. The typical i, jth element in the product AB, where the matrices are 
conformable for multiplication, is ©,4;.b, ;. 

3. The typical element in A + B is a,; + b,j. 

4. A = B means a,; = 5,, for all i, 7. 

5. (AB) = B’A’, (ABO) = C’B’A’. 

6. (A+B)+C=A+(B+O,. 

7. (AB)C = A(BC). 

8. A(B + C)=AB+ AC. 

9. A= AI=A. 

10. The typical element in cA, where c is a scalar, is ca; ;. 

11,A+0=0. 

12. AO= 0A = 0. 
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4-2 MATRIX FORMULATION OF THE LEAST-SQUARES 
PROBLEM 


Returning again to the linear model y = XB + u, this may be written in the form 


B, 
B, 
YorlXuy Sate yates a9] a 
B, 
that is, 
y =B,x, + Bx, +--+ Bx, +u (4-22) 


Equation (4-22) states that the observed y vector is the sum of the disturbance 
vector u and a /inear combination of the columns of X. If we replace the unknown 
B’s in Eq. (4-22) by guesses or estimates denoted by ,, b,,..., b,, then for the 
tth observation we have an observed value Y, and a calculated value 

BX, + by Xo, + 11+ + Oy Xe 


The difference between these two values defines a residual or error term, 


ep = ¥— bX, — 0 — Xie 
Repeating the procedure for all sample points gives 
y = bx, + bx, +++ + b,x, +e =Xb+e (4-23) 


where b is a k-element coefficient vector and e is an n-element vector of residuals. 
The least-squares principle states that the b’s in Eq. (4-23) should be chosen 
to minimize the sum of the squared residuals. This sum of squared residuals is 


BS eat a iLidasuan tee 
ee=[e, ec, 77> elf. peer tert +e 


From Eq, (4-23), 
e=y-— Xb 
Hence 
e’e = (y — Xb)’(y — Xb) 
= (y' ~ X’)(y ~ Xb) 
=yy —bXy — y’Xb + b’X’Xb 
= y’y — 2b’X’y + bX'Xb (4-24) 
since y’Xb is a scalar and so is equal to its transpose b’X’y. Once the sample data 
have been obtained, y and X consist of known numbers. Thus Eq. (4-24) expresses 
e’e as a function of the unknown b vector, 
ee = f(b) (4-25) 
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and, treating the elements of b as variables, we have to minimize e’e with respect 
to b. This requires some elementary results on matrix differentiation. 


Matrix Differentiation 


If f(b) contains, say, k different b’s, then we may partially differentiate f(b) with 
respect to each 6, in turn, obtaining k partial derivatives. Arranging these partial 
derivatives in the form of a column vector gives the general definition 


aL f(b] 
ab, 
a f(b)] 
=| (4-26) 


a f(b)] 
Ob, 
These derivatives might equally well have been arranged as a row vector. The 
important requirement is consistency of treatment and ensuring that vectors and 


matrices of derivatives that have to be added and multiplied are of appropriate 
order. 


Suppose f(b) is a linear function, 
f(b) = ab 
= ab, + ab, + +--+ + a,b, 
where the a’s are given constants. Application of Eq. (4-26) then 


Ci 
, , a 
CL) nt ths) L G is (4-27) 
a, 
Suppose now that f(b) is quadratic in the b’s, that is, 
f(b) = b’Ab 


This is a quadratic form in b, and we will develop the properties of quadratic 
forms later. Without any loss of generality we can suppose A to be a symmetric 
matrix denoted by 


CS vis bats 
A=]a) ay 92K 
BK An, Ixy 


7 If A were not symmetric, define A* = (A + A’)/2, Then b'Ab = b’Ab/2 + b’Ab/2 = b/AB/2 + 
b’A’b/2, on transposing the second b’Ab. Thus 
A+A’ 
2 


b’Ab = b’ b= b/A*b 


ELEMENTS OF MATRIX ALGEBRA 103 


Then 
b’Ab = a,b? + 2a,b,b, + 2a,3b,b, + +++ + 2a, bby 
+ dyb? + 2a,3b,b, + --> + 2a,,b,b, 


rear he 


Taking partial derivatives, 


ae = 2(a4b, + aypbp + + 4,b,) = 2ab 

1 

ara) = 2( ayy, + doyby He + Ayhdy) = 2ayb 
k 


where the a’s indicate the rows of A. Collecting these partial derivatives in a 
column vector, 


a,b a, 

, a,b a 
a(b'Ab) _ 4] "7" | = 2] 7 |b = 2Ad (4-28) 

db ; : 

es ay 
Equations (4-27) and (4-28) give the standard results on the differentiation of 
linear and quadratic forms. Notice the parallel with the differentiation of scalar 
functions in that the power of the variable is reduced by | so that it disappears on 
differentiation of the linear form and appears linearly on differentiation of the 


quadratic form. a : 
These two results may now be applied directly to minimize the residual sum 


of squares defined in Eq. (4-24). 


a(bX’'y) _ y, 
Br 
using Eq. (4-27), since X’y is just a known k-element vector. Also 
a(b’X’Xb) 
—— /X)b 
ab 2(X'X) 


using Eq. (4-28), Thus 
a(e’e) 
OSS! = —2X'y + 2X'Xb 
ab a 
For a stationary value of the sum of squares all k partial derivatives must be zero, 


that is, 


A(e'e) _ 
a 
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and so 

(X'X)b = X’y (4-29) 
These are the normal equations for the least-squares regression, and include the 
equations for the two- and three-variable cases already derived in Chaps. 2 and 3. 


Example 4-5 Two-variable regression For the two-variable regression the X 


matrix is 
Let 
X=]1 X, 
iN 
Thus 
n> DX 
XX = i and X’y = [evr] 


So Eq. (4-29) gives 
nb, + b,x X = LY 
b,LX + EX? =LXY 
which are identical with Eqs, (2-13). 


Example 4-6 Three-variable regression In this case 
n rx, rx, 
XK=|2X, UX} LX, 
DX, 2X,X, cx? 
and 
xy 
Xy =| UXYY 
X,Y 


which, on substitution in Eq. (4-29), yield Eq. (3-62). 


4-3 GEOMETRIC INTERPRETATION OF LEAST SQUARES 
In many problems it is often helpful to have a geometric as well as an algebraic 


interpretation. We will thus introduce some basic notions on the geometry of 


vectors. 
Consider a two-element vector 
i| 
= 
FE 
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2d component 


-* 
5 —> Ist component Figure 4-1 


This may be pictured as a directed line segment, as shown in Fig. 4-1. The arrow 
denoting the segment starts at the origin and ends at the point with coordinates 
(2,1). The vector a may also be indicated by the point at which the arrow 
terminates. If we have another vector, say, b, 


-[ 


the geometry of vector addition is conceived as follows. Start with a and then 
place the b vector at the terminal point of the a vector. This takes us to the point 
P in Fig, 4-1. This point defines the vector ¢ as the sum of vectors a and b, and it 
is obviously also reached by starting with the b vector and placing the a vector at 
its terminal point. The process is referred to as completing the parallelogram, or 
as the parallelogram law for the addition of vectors. Clearly, the coordinates of P 


are (3,4), and 
seve] BI-E 


so that there is an exact correspondence between the geometric and the algebraic 


treatments. 
Now consider scalar multiplication of a vector. For example, 


~»lil-[ 


gives a vector in exactly the same direction as a, but of twice the length. The 
scalar multiplier may also be a negative number. For example, 


—3a= 78 
These two vectors are shown along with a itself in Fig. 4-2. Clearly, all three 
that line being uniquely 


terminal points lie on a single line through the origin, 
defined by the vector a. 
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2d component 
' 


Ist 
component 


Figure 4-2 


Combining the two operations of scalar multiplication and addition of 
vectors enables us to represent any two-element vector as a linear combination of 
the vectors a and b. If ¢ denotes any arbitrary vector and it is to be represented as 
a linear combination of a and b, then we may write 


c¢=),a+),b (4-30) 


where \, and A, are appropriate scalars. As an illustration of Eg. (4-30) consider 
the following examples: 


el 


may be expressed as 


2 1] _ [2+] [4 
stil “| A hi is | os [*| 
2.6 -[$] 
may be expressed as 


fil 


giving A, = 3 andA, =0. 
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s[ 


may be expressed as 


Alf 


giving A, = 2 and A, = 2. 
We are now in a position to define a vector space. A vector space is a 
collection of vectors with the following properties: 


1. If vy, and y, are any two vectors in the space, then v, + ¥, is in the space. 
2. If v is in the space and A is a scalar constant, then Av is in the space. 


The set of vectors is said to be closed under addition and scalar multiplication, for 
these operations do not produce a vector outside the space. 

Let us denote the two-dimensional space by the symbol 7. This vector space 
consists of all real two-element vectors. Clearly, any vector in the space can be 
expressed as a linear combination of the two vectors a and b. Our specification of 
a and b, however, was arbitrary. Consider another pair of vectors 


a-[j ow ofl 


These are usually described as unit vectors, and again any vector c in GR? may be 
expressed as a linear combination of these vectors, only now the determination of 
the \’s is particularly simple. The three previous numerical examples in this case 


give 


a ale 0 = = 
1. [4] =4[2] +02] so A, = 4,42 = 0 

6} _ 6! 0 = = 
2. [s]- [3] +3{9| so A, = 6,4, =3 

6} _,«|1 0 = = 
3. [S| = 63] +s[¢] so A, = 6, A, = 8 


so that the A’s are read off directly as the elements of the ¢ vector. 

Each pair of vectors in these examples, that is, a, b, and e,, @,, serves as a 
basis for the two-dimensional space R”. A basis is thus not unique. It is clear 
from the geometry that any two vectors can serve as a basis for R’ only if they 
point in different directions. If a and b point in the same direction, then one is 
simply a scalar multiple of the other, as in Fig. 4-2, and only further multiples of 
that vector can be expressed as linear combinations of a and b. The condition that 
the basis vectors point in different directions may also be expressed by stating 
that the vectors should be linearly independent. Two vectors a and b are said to be 


linearly independent if the only solution to 
Aya t Ab = 0 (4-31) 
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is A; = 0 =A). If A values can be found, at least one of which is nonzero, to 
satisfy Eq. (4-31), then the vectors are said to be linearly dependent. 
As an illustration of these definitions, suppose 


+f] mt 9-[9 


These vectors lie on the same tay through the origin, and the linear combination 
3a — b yields the zero vector. However, if we revert to the original a and b 


vectors, namely, 
| ay 
a= [? and b= [ i] 


it is impossible to find a pair of A values, other than two zeros, such that 

Aja t+ A,b=0 
We can easily find a pair of A values to teduce the first element to zero, but the 
same combination will not reduce the second element to zero. A basis for R? is 
thus defined to be any linearly independent pair of two-element vectors. It is clear 
from the geometry of the two-dimensional case that the representation of a given 


vector in terms of a given basis is unique, that is, there is one and only one pair of 
A,, A, values which satisfy 


c¢=)a+d),b 
Given a basis a,b for GR, we have seen that any vector ec in R2 may be 


expressed as a unique linear combination of the basis vectors, Thus the vectors a, 
b, and ¢ are linearly dependent, for the equation 


Aja+A,b-c=0 
holds for nonzero A’s, We might ask whether any arbitrary vector y in ? may be 


expressed in terms of the expanded set of vectors a, b, and c. The answer is, of 


Course, yes, but the coefficients will not be unique. For example, suppose that the 
a, b, and ¢ vectors are 


fl Bl Ly 


and we wish to express 
He Hs 
be [3] 


as a linear combination of a, b, and c. One such combination is simply 
V = 2a + 2b + 0c 
but there are infinitely many others. Rewriting the general linear combination 
V=)ja+)A,b+ Ac 
in the form 
v—);ce=A,at Ab 
any arbitrary value can be assigned to A;, and the left-hand side is then some 
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specific two-element vector, which can be expressed as a linear combination of a 
and b. A set of vectors such as a, b, and c is called a spanning set since they span 
or generate the space G2, that is, any vector in R? can be expressed as a linear 
combination of the spanning vectors. The distinction between a basis and a 
spanning set is that the basis consists of linearly independent vectors. A spanning 
set may be unnecessarily large, as in the case of the set a, b, and c. This spanning 
set can be reduced to a basis by dropping one vector. 

Since linearly independent vectors point in different directions, there is then a 
nonzero angle between the vectors. This angle may be expressed in terms of the 
elements of the vectors. 

Reverting to the two-dimensional a,b vectors in Fig. 4-3, let A denote the 
angle between the a vector and the horizontal axis and let B denote the angle 
between the b vector and the horizontal axis. The angle between the two vectors is 


6=B-A 
An elementary result in trigonometry states 
cos(B — A) = cos B+ cos A + sin B+ sin A (4-32) 


The length or norm of the vector a is, by Pythagoras’s theorem, vat + a3. The 
length is often denoted by the symbol |fall, and using the definition of the inner 
product in Eq. (4-8), we have 


lial? = a’a 
Likewise, 

|[b||? = b’b 
Substituting in Eq. (4-32) gives 

a,b, a,b, 


08 8 ree Yaavbb 


2d 
component 


Ist 
component 
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that is, 
a’b 
0050 Ss (4-33) 
Va'a yb’b 
where it is understood that we take the positive square roots to indicate length. 
There are two important special cases of Eq. (4-33). When a and b are 
linearly dependent, we may write 
b=)a 
where A is some appropriate scalar. The right-hand side of Eq. (4-33) then reduces 
to unity, giving @ = 0°. When a and b are at right angles to each other, 9 = 90° 
and cos @ = 0, giving a’b = 0. Conversely, when a'b = 0, @ = 90°. Two vectors at 
right angles are said to be orthogonal. Thus two vectors are orthogonal if and only 
if 
a’b=0 


Extensions to Three and Higher Dimensions 


If we now consider rea] three-element vectors, then each vector corresponds to a 
point in the three-dimensional space ®*. Any vector vy in @* may then be 
expressed as a unique linear combination of an appropriate set of three linearly 
independent vectors, which constitute a basis for R*. For instance, choosing 


0 
e=10 
1 
as basis vectors, a vector ” =[3 -2 5] may be written 


v = 3e, — 2e, + Se, 


If we take just two of these vectors, say, €, and e,, then all linear combinations of 
e, and e, constitute a vector subspace in Q, namely, the horizontal plane, since 
the third component in each spanning vector is zero. More generally, any two 
three-element vectors, say, 


1 5 
a=/2 and b=/1 
3 1 


os Or generate a plane surface, as indicated in Fig. 4-4, by the plane containing 
Qab. 

The set of all real n-element vectors constitutes the space R”. Each vector in 
G" may be expressed as a unique linear combination of some appropriate set of ” 
linearly independent vectors. To see that the linear combination must be unique, 
suppose that a vector vy can be expressed as two different linear combinations of 
the basis vector ¥,,¥,,..., v,,, Namely, 


v=Ay, +Aw,t+--- +A,y, 


and 
V= BM, + aM + +++ + Hy, 
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3rd 
component 


t 


2d 
component 


a Ist 
component 


oO 


Figure 4-4 


Subtracting one equation from the other gives 
0= (A, — wy + (Ar = Made to + Qu = Ba) \n 
But the basis vectors are linearly independent and so 
Ay — Bi Ag =), - Hn =0 


and the representation is unique. If we take a set of k (< 7) linearly independent 
n-element vectors, these generate a subspace of GR", which is termed a hyperplane. 
The dimension of this subspace is the number of linearly independent vectors 
spanning the subspace. The parallelogram Jaw of addition and the cosine law of 
Eq. (4-33) apply to the general case of n-element vectors. We can thus conceive of 
a set of mutually orthogonal vectors V,, Y25- ++» Yk if 


vv, =0 for alli, j,i */ 


We are now in a position to complete the geometric treatment of the 
least-squares problem. The matrix 


X=[x, X27 x;] 


consists of k n-element column vectors, where, by assumption, we have more 
observations than variables, so that n > k. The columns of X span a subspace in 
§". The dimension of the subspace cannot exceed k and will only be equal to k if 
the columns of X are linearly independent. We refer to this subspace as the 
column space of X. It is highly unlikely that the y vector lies in the column space 
of X. If it did, y could then be expressed exactly as a linear combination of the x 
vectors, giving zero residuals at all sample observations. The general case is 
depicted in Fig. 4-5, where y lies outside the column space of X. 
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Figure 4-5 


Assigning arbitrary b’s to the x vectors gives 


Xb =[x, x2 => xq]. | =x, + bx, +--+ + yx, 
b, 
which is then a yector that lies in the column space of X. Choosing different b 
Vectors gives, in turn, different Xb vectors. To each such Xb vector there 
corresponds a vector of residuals e, so that the equation 
Y=Xb+e 


as Fig. 4-5 shows, gives y as the sum of two yectors, of which one, Xb, lies in the 
space spanned by the columns of X and the other, e, lies outside that column 
space, 

We wish to choose the b vector so as to make the point given by the tip of the 
Xb vector as close as possible to the tip of the y vector or, in other words, to 
minimize the length of the e vector. This is achieved by making the e vector 
perpendicular to the hyperplane generated by the columns of X. Thus e must be 
orthogonal to any linear combination of the columns of X. We have 


e=y-— Xb 


and Xe is any arbitrary linear combination of the columns of X. Thus the 
orthogonality condition gives 


'X'(y - Xb) = ¢'(X’y ~ X*Xb) = 0 (4-34) 
Since ¢ is any arbitrary nonnull vector, this condition gives 
Xy — X’Xb=0 
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or (X’X)b = X’y (4-35) 
which are the least-squares normal equations derived algebraically in Eq. (4.29). 


Summary on Vector Geometry 


1. A vector space is a collection of vectors with the following properties: 

a, If y, and y, are any two vectors in the space, then v, + ¥, is in the space. 
b. If v is in the space and A is a scalar, then Ay is in the space. 

2. If the only solution to Aja + A,b = 0 is A, = 0 =A), then a and b are 
linearly independent vectors. Otherwise they are linearly dependent. In general 
if the only solution to A,x, + A,Xz +-++ + A,X, = 0is Ay =A, = °° =A, 
= 0, the n vectors are said to be linearly independent. If at least one d is 
nonzero, they are linearly dependent. 

3. A basis for @2 is any linearly independent pair of two-element vectors. 
Likewise, a basis for @? is any three linearly independent three-element 
vectors. 

4, Each vector in a space may be expressed as a unique linear combination of a 
set of basis vectors, and the minimum number in such a set is the dimension 
of the space. 

5. The angle @ between vectors a and b is defined by 

ef ial ta 
Va’a yb’ 


6. Two vectors are orthogonal when a’b = 0 (8 = 90°). 


4-4 SOLUTION OF SETS OF EQUATIONS 


The next problem is how to solve Eq. (4-35) for the desired least-squares 
Coefficients b. From the original definitions the dimensions of Eq. (4-35) are as 
follows: X’X is a square matrix of order k X k, and b and X’y are all k-element 
vectors. Thus Eq. (4-35) expresses the X’y vector as a linear combination of the 
columns of X’X, and b indicates the coefficients of that linear combination. If the 
columns of X’X are linearly independent, they constitute a basis for R*, and any 
k-element vector, such as X’y, may then be expressed uniquely in terms of the 
basis vectors. In other words, Eq. (4-35) has’a unique solution for the b vector. 

The solution of Eq. (4-35) for b may be expressed in terms of an inverse 
matrix. The meaning of an inverse matrix may be developed as follows. Let A be 
a square matrix of order n and let the n columns of A form a linearly independent 
set. Does a square matrix B of order exist such that 


AB=1? (4-36) 


+A vector such as e, which is orthogonal to every vector on the hyperplane generated by the 


columns of X, is said to be normal to the hyperplane—hence the term normal equations. 
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The answer is yes. Letting b, denote the first column in B and equating first 
columns on both sides of Eq. (4-36) gives the vector equation 


Ab, =e, (4-37) 


wheree, =[1 0 0 --- 0}. Since the columns of A are linearly independent, 
the vector e, can be expressed as a unique linear combination of those columns. 
Thus b, is uniquely determined. By a similar argument each column of B is 
uniquely determined, and so there is a matrix B satisfying Eq. (4-36). 

We shall see later in this section that if the n columns of A are linearly 
independent, then so are the n rows. Then by a similar argument, a square matrix 
C of order n can be found such that 


CA =I (4-38) 
for each row of C is uniquely determined as the coefficients of a linear combina- 


tion of the rows of A. Thus Eqs. (4-36) and (4-38) are both true. Postmultiplying 
Eq. (4-38) by B gives 


CAB = IB=B 
But 
CAB = C1=C 
using Eq. (4-36). Thus 
C=B 


Thus if the m columns (and rows) of A are linearly independent, a unique square 
matrix of order n exists, called the inverse of A, and denoted by A~', such that 


AAT! =ATA=1 (4-39) 
If we assume that the k columns of X’X are linearly independent, then the 


inverse matrix (X’X)~' exists. Premultiplying both sides of Eq. (4-35) by this 
inverse gives 

b = (XX) 'X’y (4-40) 
This expresses the least-squares vector b in terms of the sample data incorporated 
in X and y. Equations (4-35) and (4-40) are two equivalent ways of expressing the 
vector of least-squares coefficients b. Two distinct issues arise with respect to these 
equations. First, there is the numerical, or computational, question of how best to 
compute b for given X and y. Second, there is a set of theoretical questions about 
inverse matrices like (X’X)~', such as, how are the elements of the inverse defined 
and what are the properties of inverse matrices? Our main interest lies with the 
second group of questions, though we will give some illustrations of numerical 
solution methods. We have defined the inverse (X’X)~' to exist when the columns 
of X’X are linearly independent. This concept is intimately related to the concept 
of the rank of a matrix, and it is to this topic that we now turn. 


Rank of a Matrix 


Consider any arbitrary matrix A of order m X n. The columns of A define 7” 
vectors in R”". Likewise, the rows in A define m vectors in R”. Let r denote the 
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maximum number of linearly independent rows in A, so that r < m. When r is 
strictly less than m, there may, of course, be more than one subset of row vectors 
which are linearly independent. For example, suppose we have a matrix with four 
rows (m = 4). It may be that rows 1, 2, and 4 form a linearly independent set and 
that rows 1, 3, and 4 also form a linearly independent set, but that all four rows 
are linearly dependent. In this case r = 3. Returning to the general matrix A, let 
us form a new matrix A by taking any set of r linearly independent rows and 
discarding the remaining m — r rows. A is then of order r X n. Let c indicate the 
maximum number of linearly independent columns in A. Then c must also 
indicate the maximum number of linearly independent columns in A. Each 
column in A has r elements. Thus we have immediately that 

e<r 
for any vector in ®” may be expressed as a linear combination of r linearly 
independent vectors. a 

Reversing this argument we might form a matrix A of order m X c by 

retaining a subset of c linearly independent columns of A and discarding the 
remaining n — c columns. Since r is defined as the maximum number of linearly 
independent rows in A, it also denotes the maximum number of linearly indepen- 
dent rows in A. But since each row in A has just ¢ elements, we have 

ESC 
Thus 

r=c 
that is, for any m X n matrix A the maximum number of linearly independent rows 
is equal to the maximum number of linearly independent columns. This number is 
defined to be the rank of the matrix, and we will denote the rank of A by the 


symbol 


p(A) 
Example 4-7 Consider 
Sod2ed Bridh4, 
AKA aOioe ib 
Dip, RE Ne} 


By inspection rows 1 and 2 are linearly independent; also rows 1 and 3 are 
linearly independent, but row 1 + row 2 — row 3 gives the zero vector, so all 
three rows are linearly dependent. Thus r = 2, Let us form a matrix A by 
discarding the third row of A. Thus 


[keer 4 
pe | Tea dipd 
Clearly, all pairs of columns of A are linearly independent. Thus c is at least 
equal to 2. But it cannot exceed 2, for any column in A can be expressed as a 
linear combination of a pair of columns. For example, 
col3 = coll + col2 


col 4 = col 1 + 1.5 col2 
col 1 = 3 col3 — 2col4 
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and so on. Thus r = c = 2 = p(A). Alternatively, if we commence with the 
columns of A, we cannot find a set of three linearly independent columns for 
the relations stated above, for the columns of A also hold for the columns of 
the full matrix A, as readers should verify for themselves. 


It is obvious that the rank of a matrix cannot exceed the number of columns 
or the number of rows, whichever is the smaller. That is, 


p(A) < min(m, n) (4-41) 


When p(A) = m, we say that the matrix has full row rank, and when p(A) = n, 
that it has full column rank, but, of course, in any specific case, row rank and 
column rank are identical, and we speak unambiguously of the rank of the matrix. 

Notice that it follows directly from the definition of rank that the rank of the 
transpose of A is equal to the rank of A, that is, 


p(A’) = p(A) (4-42) 
for p(A’) = number of linearly independent columns (rows) in A’ 
= number of linearly independent rows (columns) in A 
= p(A) 


In the special case where A is a square matrix of order n and rank n, then A is 

said to be nonsingular, and a unique inverse A~' exists, such that 
AA"'=A-'A=I, 

When the rank of A is less than n, A is said to be a singular matrix and its inverse 

does not exist. 

Returning now to the general case of an m X n matrix A, let us suppose 
p(A) = r. Thus there is at least one set of r linearly independent rows and at least 
one set of r linearly independent columns. If necessary, rows and columns may be 
interchanged so that the first r rows and the first r columns are linearly 
independent. The matrix may then be partitioned by the first r rows and columns: 


i Ay % Ay } pied 
An 1 An } bra 
iter big ees 


Thus A,, is a square nonsingular matrix of order r. Consider now the set of 
homogeneous equations 


Ax=0 (4-43) 


where x denotes a column vector of n unknowns. The equations are said to be 
homogeneous because of the 0 vector on the right-hand side of Eq. (4-43). If the 
equations read Ax = b, for b = 0, they are said to be nonhomogeneous. Clearly, 


if x, is a solution to Eq. (4-43), then so is cx, for any scalar c, 
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and 


if x, and x, are two distinct solutions to Eq. (4-43), then c\X, + CX is also a 
solution. 


Thus the set of solutions to Eq. (4-43) constitutes a vector space called the 
nullspace of A. Our immediate concern is to establish the dimension of this 
nullspace (that is, the number of linearly independent vectors which span the 
subspace). Let us drop the last m — r rows from A and partition x conformably 
with the columns of A. This gives 


[An Anl[x] =10 (4-44) 


where x, contains r elements and x, the remaining n — r elements. This gives a 
set of r linearly independent equations in > r unknowns. Rewriting as 


Aux, + Aix, = 0 
and premultiplying by Aj,!, which exists since Aj, is nonsingular, gives 
x, = —ApApx2 (4-45) 
The x, subvector is arbitrary or “free” in the sense that we can specify the n — 1 
elements in x, at will, but for any such specification the subvector x, is 
determined by Eq. (4-45). Using Eq. (4-45), the general solution vector to Eq. 
(4-44) may be written 
x 
mn | l ‘ 
a] 


The matrix in Eq. (4-46) has n rows and n — columns. The n — r columns are 
linearly independent. This fact is guaranteed by the presence of the I,,_, sub- 
matrix, whose columns are necessarily linearly independent. Thus Eq. (4-46) 
expresses all solutions to Eq. (4-44) as linear combinations of n — r linearly 
independent n-element vectors. But any solution to Eq. (4-44) is also a solution to 
Eq. (4-43), for the rows that have been discarded from A to arrive at Eq. (4-44) 
are linear combinations of the rows of [Ay Aja] Any discarded row may thus 
be expressed in the form 


were 
An Any, (4-46) 
I, 


TF 


e[Ay, Ail 
where ¢’ is some appropriate row vector of r elements. Postmultiplying by x gives 
efAy Anlx=0 


since x satisfies Eq. (4-44). Thus each solution x holds for the discarded rows, and 
Eq. (4-46) defines the solution vector for Eq. (4-43). Thus the nullspace of A has 
dimension n — r. This gives the important result that for an m X n matrix A with 
tank r 


Number of columns = rank + dimension of nullspace (4-47) 


n=rt+(n-r) 
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The nullspace is sometimes referred to as the kernel of A and its dimension as the 
nullity. Thus the result may also be stated as 


Number of columns = rank + nullity 


Example 4-8 Consider 


A x 0 
x 
iP apa ae | a Vea fA 
WJ ty Agere 0 (4-48) 
Heal viel". Sees Wee A 
4 


The rank of the matrix is seen to be 2 since rows | and 2 are clearly linearly 
independent, as are rows | and 3, but all three rows are not linearly 
independent, since 

row 1 + row2 — row3 = 0 


Discarding the third row gives the set 


x) 
1 2 3 4]/%2|_fo 4 
[i Duta | X3 - (5 (2 
x4 


Columns | and 3 are linearly independent, so we rewrite Eq. (4-49) as 


[i lfel--[2 af] 


Solving this pair of equations for x, and x, gives 


X, = —2x, + 1/2x, 
x3 = a3 284 
and the solution vector to Eq. (4-49) may be expressed as 
= 9, 4 
ae 1 0 P| (4-50) 
0 -—3|l%4 
0 1 


The matrix in Eq. (4-50) has two linearly independent columns, and so any 
solution to Eq. (4-49) may be expressed as a linear combination of two 
linearly independent four-element vectors. The solution vectors thus form a 
two-dimensional subspace in R*. There are infinitely many solution vectors 


x 
since the vector i on the right-hand side of Eq. (4-50) is arbitrary. Ary 


solution defined by Eq. (4-50) is also a solution to the initial set of equations 
(4-48). This may be seen by noticing that each column vector in Eq. (4-50) 
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satisfies the equation discarded from (4-48), that is, 


[2 4 4 5] 


oorn 


and 


(2 4 4 Ss}] _ 


— ne O KE 
i 
o 


Since any solution vector x is a linear combination of these two column 
vectors, then x satisfies the third equation in Egs. (4-48). Since it already 
satisfies the first two equations, it is a solution to Eq. (4-48). The nullspace of 
A thus has dimension 2, which is equal to the number of columns in A minus 
the rank of A. Each vector in the nullspace is orthogonal to each row in A. 

There is a seeming element of arbitrariness in the partitioning that we 
applied to Eq. (4-49) and also in the choice of the row of A to be discarded. 
But this is apparent, not real. For example, suppose we partition Eq. (4-49) as 


1 iifel--[r all] 


which gives 


X= — 3x, — 6x 
Xg= 2x, + 4x, 
with solution vector 
1 0 
s 
see) | (4-51) 
2) 4 


Equation (4-51) again defines the nullspace of the matrix A in Eq. (4-48). It 
has dimension 2, and the columns. in the matrix of Eq. (4-51) are linearly 
independent. This, however, is the same nullspace as defined by Eq. (4-50), 
for each column vector in Eq. (4-51) may be expressed as a linear combina- 
tion of the column vectors in Eq. (4-50): 


1 -2 2 

O} al trol Of araO Aan? 
=3 = 3. 

2 0 1 

and 

0 -2 + 

1 1 0 Ae 2 
Peake. he +A, << => Apel ue Api 4 

4 0 1 
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Thus the nullspaces are the same. Discarding the first or the second equation 
from A in the initial stage would also make no difference to the determination 
of the nullspace. 


We may note here a particular application of Eq. (4-47), which will be very 
useful in the treatment of identification in Chap. 11. If A is m X n and has rank 
n — 1, then the dimension of the nullspace of A is 1, that is, al/ solutions to 

Ax =0 
lie on a single ray through the origin. Thus if 
X=) x2 xl 
is a solution, then so is 
x! = [iexpnvex, lesitieatexs] 
for any constant c. 

Result (4-47) also yields simple proofs of some important theorems on the 
ranks of various matrices. We notice that the crucial matrix for the least-squares 
vector in Eqs. (4-35) and (4-40) is XX. The first important theorem states that 

(XX) = p(XX’) = p(X) (4-52) 
Let X be n Xk with p(X)=r. Then by Eq. (4-47) the nullspace of X has 
dimension k — r. If m denotes any vector in this nullspace, 

Xm = 0 
Premultiplying by X’ gives 
X’Xm = 0 

Thus m also lies in the nullspace of X’X. Let s be any vector in the nullspace of 
X’X. Then 


XXs = 0 
Premultiplying by s’ gives 
s’X’Xs = (Xs)'(Xs) = 0 
Thus Xs is a vector with zero length and so must be the null vector, that is, 
Xs=0 
Thus s lies in the nullspace of X. 


We have shown that X and X’X have the same nullspace and hence the same 
nullity. Each matrix has k columns. Thus by Eq. (4-47) each matrix has rank r 
since 

Rank = number of columns — nullity 


For the least-squares case X is n X k with k <n. Provided there are no exact 
linear relations between the explanatory variables, X has full column rank, 
and so 

p(X’X) =k 
Since X’X is a square matrix of order k, it is then nonsingular and the inverse 
(X’X)~! exists. 
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To prove the rest of theorem (4-52) we merely note that p(X) = p(X’), and 
the above proof immediately gives 


p(XX’) = p(X’) = p(X) 


Notice that XX’ is a square matrix of order n (> k) so that even if X has full 
column rank, XX’ is still singular. 

Another important theorem on rank may be stated as follows. If A is any 
m Xn matrix with rank r, and P and Q are square nonsingular matrices of order 
m and n, respectively, then 


(PA) = p(AQ) = p(PAQ) = p(A) (4-53) 
that is, pre- or postmultiplication of A by a nonsingular matrix does not change its 
rank, 

To prove p(PA) = p(A), let m be any vector in the nullspace of A. Then 

Am=0 

Thus 
PAm = 0 

and m also lies in the nullspace of PA. Conversely, let s be any vector in the 

nullspace of PA. Then 
PAs = 0 

Since P is nonsingular, we may premultiply this equation by P~! to obtain 
As=0 

Thus s also lies in the nullspace of A, and PA and A have the same nullity and the 


same number of columns. Hence the ranks are the same. 
To prove p(AQ) = p(A) we note that 


(AQ) = e(QA’) 
= p(A’) _ by the above proof 


= p(A) 
and finally 
p(PAQ) = p(A) 


follows directly from the previous results. ; 

Both previous theorems involve special cases of the multiplication of one 
matrix by another. In Eq. (4-52) a matrix was multiplied by its transpose. In Eq. 
(4-53) multiplication was by a nonsingular matrix. Our final theorem on rank 
relates to the perfectly general case of the multiplication of one rectangular matrix 
by another conformable rectangular matrix. Let A be m X nand let B ben Xs. 
Then 


p(AB) < min[o(A), e(B)] (4-54) 


that is, the rank of the product AB is less than or equal to the smaller of the ranks of 
the constituent matrices. 
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If x denotes any vector in the nullspace of B, then 

Bx =0 
and so 

ABx = 0 
Thus x also lies in the nullspace of AB. But this time we cannot go in the opposite 
direction and prove that 

ABy = 0 implies By =0 

Thus all we can say is that the nullspace of B is contained in (or is a subspace of) 
the nullspace of AB. Therefore, we have 


Dimension of nullspace of B < dimension of nullspace of AB 
Since B and AB have the same number of columns, it then follows from Eq. 
(4-47) that 
p(AB) < p(B) 
By the usual trick with transposes, 
(AB) = p(B’A’) < p(A’) = p(A) 
and so Eq. (4-54) is proved. 


Summary on the Rank of an m X n Matrix A 


1. The maximum number of linearly independent rows is equal to the maximum 
number of linearly independent columns. This number is the rank of the 
matrix, denoted by p(A). 

2. p(A) < min(m, n). 

3. p(A) = p(A’). 

4. If p(A) = m = n, then A is nonsingular and a unique inverse A~' exists. 

5. n= p(A) + nullity of A where the nullity of A is the dimension of the 
subspace containing all vectors x which are solutions to Ax = 0. 

6. p(X’X) = p(XX’) = p(X). 

7. If P and Q are nonsingular matrices of orders m and n, respectively, then 
p(PA) = p(AQ) = p(PAQ) = p(A). 

8. p(AB) < min[p(A), p(B)}. 


The Inverse Matrix 


It is now time to return to the topic of matrix inversion and develop some of the 
properties of inverse matrices. We will also see how to compute inverse matrices, 
though this is a tedious and inefficient procedure for the numerical solution of 
equations. The procedure does, however, shed light on the theoretical properties 
of the inverse. 

Let A denote a square matrix of order n. The condition for the inverse to exist 
may be stated in several equivalent ways: 


1. A is nonsingular. 
2. A has rank n. 
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3. The n rows of A are linearly independent. 
4. The n columns of A are linearly independent. 


To study the form of the inverse matrix let us begin with the 2 X 2 case. 
Denote A and A! as follows: 


4, 42 “1 fee | 
A= A= 
bass a AQ) M22 

So far we have regarded matrices mostly as collections of vectors and paid little 
attention to the individual elements. The standard notation is to use the first 
subscript of an element to indicate the row in which that element appears and the 
second subscript to indicate the column. The definition of the inverse gives the 
general equation 

AA7! = (4-55) 
Specializing this equation to the 2 X 2 case and taking just the first column from 
each side of the equation gives 


a) Salli] = [0] 

4x, 422} [ %21 0 
Treating the elements of the inverse as unknowns, the solution of this pair of 
equations gives 


py Ee 
rs = 
4,427 — 41242) 
= Eri 
cod) 


41422 — 912421 


Similarly, equating the second columns in Eq. (4-55) and solving gives 
ee mai2 
41,422 — 942421 
a 


a = 
4y1422 — 412421 


Thus the inverse has been derived as 
1 | Cra re] (4-56) 


= a a 
44422 — 912421 21 ut 


AT = 


and it may readily be checked that indeed AA~! = AA~' =I. Each element in 
A-! is a function of the elements in A, and even for the 2x 2 case certain 
important features of A~! are apparent. First, each element in the inverse has a 
common divisor, namely, @,;42) — 4;2421- This is a function of all the elements in 
A. It isa scalar quantity and is defined as the determinant of A. For the 2 X 2 case 
we thus have 

det A = [A] = 4422 — 412421 ~ LX +t 0428 (4-57) 
a,B 
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The two expressions on the left of Eq. (4-57) are alternative ways of indicating the 
determinant. The final expression on the right means 


ee 4yq42~8 = 

ap 
sum of all possible products of the elements of A, taken two at a time, 
with the first subscript in natural order 1,2 and a, B indicating all 
possible permutations of 1,2 for the second subscript, each product 
term being affixed with a positive (negative) sign as the number of 
inversions of the natural order in the second subscript is even (odd). 


There are only two possible permutations of 1,2, namely, 1,2 itself and 2,1. 
There is one inversion of the natural order in 2, 1 since 2 comes before 1. Thus the 
terms in the expansion are simply 

9 {42 — 44242) 
The numerators of the elements in A~' could have been produced by the 
following two rules: 


1. For each element in A, strike out the row and column containing that element 


and write down the remaining element prefixed with a positive or negative 
sign in the pattern 
[sonal 
- + 


This gives the matrix 


Sie al 

ay 4, 
2, Transpose the matrix obtained in rule 1 to get 
ax al 

742 a, 


Let us try to apply these rules to the 3 X 3 case. Now we have 
4 42 3 
A= [421 422 433 


By extension of Eq. (4-57) we define the determinant of A as 

detA = |Al= Y + ajarga, (4-58) 
a,Byy 
There will be 3! = 6 terms in the expansion, since that is the number of possible 
permutations of 1, 2, 3. Half will have a positive sign and half a negative sign. The 
explicit expression is 

JA] = 411422433 + @12@9343) + 413491439 — Ay 147344) — a4747)433 — 443422931 
(4-59) 


As a check on the signs we may notice, for instance, that in the third term in Eq. 
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(4-59) the order of the second subscripts is 
ep la? 
which contains two inversions since 3 comes in front of both | and 2. The final 
term 
3520 
yields three inversions (3 before 2 and 1 and 2 before 1). 

Expressions (4-58) and (4-59) correctly define the determinant of a third-order 
matrix. The expression is already so cumbersome that the generalization to the 
nth-order case would be unpleasant. However, we shall derive below a more 
tractable expression for the determinant. 

The numerator rule for the second-order case, however, does not extend to 
the third-order case without modifications. If we strike out the row and column 
containing, say, a,,, we are now left with the 2 X 2 submatrix 


| ay 4; | 

G32 “433 

rather than a scalar element. We, in fact, replace a,, with the determinant of this 
submatrix, appropriately signed, and similarly for the other elements. The general 
rules for determining the elements of A~' in the 3 x 3 case may now be stated, 


Let M,, be the determinant of the 2 x 2 submatrix obtained when row 7 and 
column j are deleted from A. M,, is termed a minor. Further define 


it 
Gj ie (= 1) ’M,; 


C,, denotes a cofactor and is simply a signed minor. Thus the sign of M,, does not 
change if i + j is an even number and does change if that sum is odd. 
The rules then become as follows: 


1. Form a matrix in which each element (a;;) is replaced by the corresponding 


cofactor (C;;). r 
2. Transpose this matrix. The result is sometimes referred to as the adjugate or 


adjoint matrix. 
3. Divide each element in rule 2 by |A|. The result is A~ £ 


For the third-order case 


ay 43 
C= re Gy, | 222933 473932 
dy a 
Cys dst be = = (4,432 = 412431) 
and so on, and 
1 Cr Gr Gi a0) 
-l-_—_|C, c 4-60 
A Al 2 & Ge 
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Finally, we may note an alternative expression for the determinant of A. Return- 
ing to Eq. (4-59) and collecting terms in the elements of the first row gives 


IA] = 441 (429433 — 9343) + @42(—aq,453 + 7343,) + )3(43)432 — 4243) 
Using the definitions of cofactors just given, we then have 
IA] = 4,0), + @Cj2 + a43C\3 (4-61) 


This defines |A| as a linear combination of the elements in the first row, each 
element being multiplied by its cofactor. This definition is clearly not unique. |A| 
may be expressed in terms of the elements of any row (or column), provided that 
in each case the elements are multiplied by the corresponding cofactors. Readers 
should satisfy themselves by direct substitution that any other similar expansion 
gives the same result as Eq. (4-59). 

These rules for the 3 x 3 case have been rather plucked out of the air. Let us 


check that they work for a numerical example before continuing to the nth-order 
case. 


Example 4-9 
Pain A, 
De ililers Dire] 
ame WK 
Replacing each element by its minor gives the matrix 
2 


Dheear 
Be Salt aS 

Gini wAos%cr.0 
304) sia) 
=A Ns _ Ik al [13 -4| 
A nel al 
De dlliee iteealipcohie «2: 


Signing the minors gives the matrix of cofactors as 
i6ae= 3) 0 
1 =3 2 


Transposing gives the adjugate matrix 


6 lens, 
miehe sj 3 
0 | 


Expressing the determinant of A in terms of the elements in the first row gives 
IAL = ayy + ayCyp + ay3Cyy = 1(6) + 3(—3) + Al). o— 
Thus the inverse matrix is 


sad Co Yl 
AT=] 1 1-1 
OD lmtinn 4 
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It is easily checked that AA~' = A™ ’=L 


For the nth-order case the rules for obtaining A~! are essentially those 
already stated for the third-order case. The determinant is defined as 


JA}=— Lt 41a2p°** Snr (4-62) 


or alternatively 
|A| = anCi + GC + 27> + GinCin for anyi=1,...,” 
(4-63) 


or 
JA] = ay ,Cyy + 4) Cay FF nj Guy for anyj = 1,...,7” 


The cofactors are now the signed minors of matrices of order n — 1, and the 
inverse matrix is 


Cy Gi 
1 
A ‘= TAL 12 oy p (4-64) 
Cin Gn 


Properties of Determinants 

The following properties are stated for the determinants of nth-order matrices, 
but they will often be illustrated for the 2 X 2 case. To economize on space, 
proofs will not always be given. 


1. |A’| = |A| even if A is not symmetric. 


GBD SAVE as ce 
ce |= ad be 5 ‘I 
2. If B is obtained from A by interchanging any two rows (or columns) of A, 
|B| = —IAl- 

sy fale et ee ee eh: Bie, 
“ii =f |= =a i al ~ "I 


rix we interchange the first two rows. Let a;, denote 


Suppose in an n X n mat t 
he i, jth element in the new 


the i, jth element in the original matrix and },, t 


matrix. A term such as 
Gjq42p 937 °°" Fnv 


in the expansion of |A| thus becomes 
Byqbipbsy °° On 


in the expansion of |B], where 
b= “it fo 2seween 
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The numerical values of these two terms are identical; the crucial question is the 
sign. To determine the sign of the second term, the first subscripts must be put in 
natural order and the number of inversions in the second subscript determined. 
This gives 

Digbrabs, oe 6, 
and, compared with the corresponding term in |AJ|, one inversion has been 
introduced or removed, so that this term (and each and every term) changes sign 
in |B| as compared with |A|. If we interchange rows i and j, which are separated 
by, say, r rows, reordering the b elements in any term to put the first subscripts in 
natural order will involve 2r + 1 changes, where each change introduces a new 
inversion in the second subscript or removes an existing inversion. This is 
illustrated below, where only the first subscripts on the b’s are shown. 


r elements 


Since 2r + 1 is an odd number, the sign of the term changes, and so |B| = — |A|. 
Property | then ensures that interchanging any two columns will also change the 
sign of the determinant. 


3. If a matrix has two or more identical rows (or columns), its determinant is zero. 
|2 b 

a b 

From property 2, interchanging identical rows would change the sign of the 


determinant. But the new matrix is identical with the old, and so its determinant 
is unchanged. This gives 


=ab-ab=a 


|Al = - Al 
so that 
|A] =0 
4. Expansions in terms of alien cofactors vanish. By this is meant an expression 
such as 
aC, sh jC a fog OyCin 

where the elements of row j are multiplied by the cofactors of the elements of 
row i. 


This is exactly the expression we would obtain for the determinant of a 
matrix whose rows i and j are identical. By property 3, that determinant is zero. 


5. If B is formed from A by adding a multiple of one row (or column) to another 
row (or column), the value of the determinant is unchanged. 


othe ee =(a+dc)d—-(b+dAd)c 


= ad — be + (cd — cd) 
| _ 
I al 


c 
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Suppose row i of B = row i of A+  - row j of A. Expanding |B| in terms of 
its ith row gives 
|B] = (a, + Aaj) Cy + (aj) + Aaj) Cp +077 + (Gin + X4jn) Cin 
= (ayCq + aij. + °°* + GisGalat MajCa + aC t+ --° + ajxCin) 
= |Al 
since the coefficient of A is an expansion in terms of alien cofactors, which 


vanishes. 


6. If the rows (or columns) of A are linearly dependent, \|A| = 0, and if they are 
linearly independent, |A| * 0. If the rows of 


phe feet) 
hes E 2 
are linearly dependent, there exist nonzero scalars \,, X2 such that 
Aya t+A,c = 0 
Ab + Agd = 0 
Thus 
» A A, 
=e d=-—b 
ic nae and a 
and A may be written 
Sal yg b 
a-[x al 


where \ = —A,/A2, and so |A| = d(ab — ab) = 0. 


In the general case if row i is a linear combination of certain other rows, 
subtracting that linear combination from row / will produce a zero row. Subtract- 
ing the linear combination is merely a repeated application of property 5 and so 
leaves the determinant unchanged, but that determinant is zero since the process 
has ended with a matrix containing a zero row. 

If the rows (columns) of A are linearly independent, there is no way to 
produce a zero row (column) and |A| + 0. Thus nonsingular matrices have 
nonzero determinants. If this were not so, the inverse matrix defined in Eq. (4-64) 
would not exist, since each element is divided by |Al- Conversely, singular 
matrices have zero determinants. 

This result also provides a means of checking on the rank of low-order 
matrices. If A is m X n and has rank r, then there must be at /east one square 
submatrix of order r, which is nonsingular and thus has a nonzero determinant, 
and all square matrices of order r+ 1, r+ 2,..., have zero determinants. Thus 
we have the following alternative definition of rank: 


Rank of m x n matrix = order of largest nonvanishing determinant 
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Example 4-10 
1 2 3 4 
A=]1 2 1 1 
2 44 =16..\ — 10 
The rank must be at least 2, since although 


) Lie 


jie” 


there are plenty of nonvanishing second-order determinants that can be 
formed from the elements of A. For example, 


1 3 
PEEING 


1 
0 


and so on. Notice that these are the determinants of second-order sub- 
matrices obtained by deleting any one row and any two columns from A. 
There are four possible third-order determinants to evaluate. Deleting the 
fourth column, 


1 2 3 
1 2 1 
2 an 6 


ours 2 Tyas 
=-12 |? ashi 24 


0 0 2 
1 2 1 
2 4 -6 


The first step in this evaluation has been to subtract row 2 from row 1, which 
by property 5 does not alter the value of the determinant. This gives an 
expansion, using Eq. (4-63), in terms of the first tow, which now contains just 
a single term. To evaluate 


is seas 
=) 4-0 


1 3 4 
1 1 1 
2-0 110 
we might subtract row 1 from row 2 and we also subtract twice row | from 
row 3 to get 
1 3 4 
0) 248g -|3 al=0 
Oveasar 18 mess 


The two other third-order determinants may similarly be seen to be zero, so 
that p(A) = 2. Alternatively, we might have spotted that 


Tow 3 = 6 - row2 — 4: rowl 


which establishes that the rank cannot be 3, without a need of evaluating 
third-order determinants. This matrix also illustrates another important point. 
Looking at the square submatrix formed from the first three columns of A, we 
have already shown that its determinant is zero, and inspection of the 
second-order determinants within it shows that its rank is 2. Its three rows are 
connected by the relationship stated above, but the same relationship does 
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not hold between the three columns. The linear dependence between the three 
columns is expressed by 
2+ column 1 — 1 - column 2 = 0 
or 
2+ column 1 — 1: column 2 + 0- column 3 = 0 


The important point is that a set of vectors is linearly dependent even if some 
(but not all) of the coefficients in the linear combination are zero. 


1. The determinant of a triangular matrix is equal to the products of the diagonal 
elements. 


= =ad 


eke 
c, 2d. 


ay 
od 


A triangular matrix may be lower triangular, as in 


or upper triangular, as in 


ay ay 13 In 
AM 0 4 423 92n 
0 Oo 433 a3, 
0 0 0 Ann 


Qo OLR oes 0 
JA] = 411} 432 33 9 
An2  4n3 ann 


Expanding the new determinant by its first row and repeating the process ” times 
gives 
JA] = 411422 °7* Fan 


Expanding |A*| successively by the first column similarly gives 
JA*| = 411422 °° nn 
Two special cases of this result follow directly: 


The determinant of a diagonal matrix is simply the product of the diagonal 
elements: 


132 ECONOMETRIC METHODS 


The determinant of the unit or identity matrix is unity: 


a, 0 0 
[AJ =| 0 ay 0 |=a,ax, Inn 
eo ge ae 
HERTS) 0 
[AJ =|0 1 O}=1 
Ra aUG 2 bitsy ; 


8. Multiplying any row (column) of a matrix by a constant multiplies the 


determinant by X. Multiplying every element in a matrix by X multiplies the 
determinant by 2". 


These properties follow directly from the definition of the determinant in Eq. 
(4-62), where it is seen that each term in the expansion is the product of n 
elements, one and only one from each row and column of the matrix. 


9. The determinant of the product of two square matrices is the product of the 
determinants. 


|AB| = |A| - |B] 


This rule is only of interest when A and B are both nonsingular. If either is 
singular, AB is singular and both sides of the equation are zero. If A is 
nonsingular, repeated applications of Property 5, that is, additions of multiples of 
rows and columns, can produce a diagonal matrix D, such that |D| = |A|. 


Example 4-11 


Ueisy ‘ 
A= FE Al with |A| = —2 


Subtract 3 - row 1 from Tow 2 to get 
= | 
0 2 


EP eal) : 
p={} x5 with |D| = —2 = |A| 


If these steps are performed on the matrix AB, the result is a matrix DB with 
|AB| = |DB| by property 5. This statement in general requires that only row 
operations have been performed on A to obtain the diagonal matrix D. This 
is always possible. The first step in the example is equivalent to premultiply- 


ing A by 
ef 


Then add row 2 to row | to get 
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and the second step to a further premultiplication by 


Al 
#=(( il 


The sequence of operations is then described by premultiplication by a single 
matrix 
rerh= (73 1] 
so that FA = D, as the reader may verify. By property 8, 
|DB| = 4\,4 °° d,,,|B| 
= |D| - [BI 
= |Al - [BI 


Thus 
|AB| = |A| - [BI 


Properties of Inverse Matrices 
1. (AB)~! = B~'A7! provided A and B are each nonsingular. 
The simplest proof is to multiply AB by the suggested inverse and see that the 
unit matrix results, since we already know that the inverse matrix is unique: 


ABB~'A~! = AIA"! =I 
and similarly, 
B-'A~'AB=1 
This technique is sometimes useful in deriving inverse matrices, namely, guess at a 
plausible inverse and check by multiplication to see whether it works. The above 
result extends readily to products of three or more matrices. Thus 


(ABC) |= C7'B UAT! 
The warning must again be inserted that this result only holds when the 
constituent matrices are nonsingular, Students occasionally produce “miraculous 
proofs by applying this theorem to rectangular matrices. 
2. (Aw) Fle A 


that is, taking the inverse of the inverse reproduces the original matrix. 
From the definition of an inverse, 
(ayant =1 
Premultiplying by A gives the required result. 
Ey (ay — (Aga 


that is, the inverse of the transpose equals the transpose of the inverse. 
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We have 


AA7!=] 
Transposing, 


(A“'YA’ =I 
Postmultiplying by (A’)~', 
(Antyar(a’) = (a) 
Thus 
(Aty = (a) 


= 1 
4. |A ay 


that is, the determinant of A~' is the reciprocal of the determinant of A. 


This follows directly from properties 7 and 9 of determinants, for 
AA“'=I 
gives 
Al (AT =1 
5. The inverse of an upper (lower) triangular matrix is also an upper (lower) 
triangular matrix. 
We merely illustrate this result for a lower triangular 3 x 3 matrix: 


ay OO 
Amant aay ah 
93; 432 33 
By inspection it is seen that three cofactors are zero, namely, 


0 0 


lh 437 433\° Gi= ah ol 1 Nicoll aa 4 
Thus 
Cy, (0 0 
Al= a C2 Cy 0 
Cx Gs Cy 


6. The inverse of a partitioned matrix: If 


ELEMENTS OF MATRIX ALGEBRA 135 


where A,, and Aj are square nonsingular matrices, 

By, —ByApAn 
—AjA2 By Az! + ADA, By ApAn 
where B,, = (Ay, — ArAzAni) |; 07 alternatively, 

Aq! + AnADBoArAT —Aj'ApBy 
=ByAnAi' By 


Aptis | (4-65) 


Al= | (4-66) 


where 
Be =A 
By = (An — Ay An Ar) 
These formulas are frequently used. The first form, Eq. (4-65), is the simpler 
if we are interested in an expression that involves just the first row of the inverse. 


Conversely, Eq. (4-66) is the simpler for expressions involving the second row. 
The derivation of the formulas is straightforward but tedious. Let 


By | 
B, Bn 


where the B,, submatrices have the same dimensions as the corresponding Aj; 
submatrices. Postmultiplying A by A~' gives the matrix equations 


A,B, + AB, =! 

A) By + ArBu = 0 

A2,By, + ArBy = 0 

A2,By + AxBy = 1 
where the unit matrix has been partitioned conformably with A. The third 
equation in this set gives 


A= 


B,, = —AyArBu (4-67) 
Substituting this in the first and solving for B,, gives 
By, = (Au - ApAWAn) | (4-68) 
A similar treatment of the second and fourth equations yields 
By = —An'Ai2Bu (4-69) 
and B,. = (An — AGA (4-70) 


These four expressions are seen to constitute, respectively, the first and second 


columns in the two alternative formulations of Av. : 
To derive the remaining columns in Eqs. (4-65) and (4-66) we multiply out 


A~'A = I to obtain 
By An + ByAn =1 


By An + BrAn=9 
B,Ay + ByAn = 0 
B, Ap + BnAn =! 
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The second equation in this set gives 
By = —ByApAx (4-71) 
Substituting this and Eq. (4-67) in the fourth equation of the first set above and 


solving for B,, gives 
B= Az! + AzA,B,Ai2Ax (4-72) 


These two expressions complete the second column in the definition of A~! in Eq. 
(4-65). The third equation of this set and the first equation of the previous set 
yield 

B,, = —ByAyAq (4-73) 
and B,, = Aj! + Ay'AyBy,A>,Aj1! (4-74) 
which completes the first column in Eq. (4-66). 


1. The inverse of a block diagonal matrix: Let A be 


au[an 9 
PO PASS 


where A, Aj,, and A, are all square matrices. If A is nonsingular, then so are 
A,, and A, since each has linearly independent columns (rows). Then 


Patt * 0 


4-75) 
0  gA; ( 


This is merely a special case of property 6 or, alternatively, it may be seen 
directly as the inverse since AA~' is clearly I. A special case of this result is the 
inverse of a diagonal matrix, If 


a, 0 0 
A=|0 ay 0 
GP ip asta By 
1 
=— 0 0 
ay 
et an) Bs 0 
S59 
0 0 as 
Any 


8. The inverse of a Kronecker product: The direct or Kronecker product of two 
matrices A and B is defined as 


4B a,B --- a,B 
A@B=|4,B a,B --- a,,B (4-76) 
4B ap Begs | Gy, B 
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In this definition A is a general matrix of order m X n, and likewise B can be 
a rectangular matrix of any order, say, p X 4. In this case A ® B is of order 
mp X ng. Suppose A is square of order m and nonsingular and that B is square of 
order p and nonsingular. The Kronecker product A ® B is square of order mp 
and nonsingular. Its inverse is given by 


(A@B) |'=A'@B! (4-77) 


The proof may be obtained by multiplying out. The right-hand side of Eq. (4-77) 
is 


C\,B GB"! Cm Bo 
a oan “int CB! CB! CB! 
CiwB Bo Gan BT Comb 


Multiplication by Eq. (4-76) yields the identity matrix. 


9. Determinants of partitioned matrices: We sometimes need to express the de- 
terminant of A in terms of the determinants of submatrices. We begin by noting 
that 

Ay 0 

= |A (4-78) 
oI JAnl 

for if we evaluate the determinant on the left-hand side by expanding in terms of 

the elements of the last row, the only nonzero term is the last one, which is unity 

multiplied by a determinant of the same form, except that the order of L has been 
reduced by 1. Proceeding in this way the result follows. 


A block diagonal matrix may be expressed as the product of two simpler 
block diagonal matrices, namely, 


af el [alle a 
7 || 0; Aes 0 Io An 
Applying property 9 of determinants and also Eg. (4-78) gives 
JA] = [Anil * |Azal (4-79) 
Now consider 


Au An). JAul (4-80) 
I 


This follows from the same argument used to establish Eq. (4-78). We can now 
find the determinant of a block-triangular matrix. 


re ( Ay An -(0 Sie al 
S10" An 0 An»llo I 


|Al = IAnl + [Azel (4-81) 


Thus 
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This is a matrix generalization of property 7 of determinants, namely, that the 
determinant of a triangular matrix is the product of the diagonal elements. The 
final step is to establish the determinant of a general partitioned matrix 


A A 
ree | " "| 
An An 


where A,, and A,, are square and nonsingular. Define 


B,-(! “Aha and oe | 
22°21 


0 I 
Then 
B,AB, = Ay - AyARAy 0 
: 0 A» 
and since |B,| = |B,| = 1, 
IA] = JAaa1 * |Ay, — AiAxA)| (4-82) 
An alternative expression may be derived in a similar fashion as 
IAL = TAu | + Ago — A2)Aj'Aj9| (4-83) 


Cramer’s Rule 


This inordinately long section on the solution of equations may be rounded off by 
the derivation of Cramer’s rule for the solution of a set of n nonhomogeneous 
equations in n unknowns. The set of equations may be written 


Ax =b (4-84) 


where, by assumption, A is a Square known matrix of order n and nonsingular, x 
is a vector of n unknowns, and b is a known n-element vector. There is an 
unfortunate clash of notation between conventions in algebra and conventions in 
Statistics. The normal equations for the least-squares vector are 


(X’X)b = X’y 
Here (XX) is a known matrix and X’y a known vector, each depending on the 
empirical data in a given problem, and b denotes a vector of unknown coefficients. 
It is too late in the day to resolve this conflict: the student must maintain 


sufficient intellectual agility to interpret the symbols according to the context. 
Returning to Eq. (4-84), the solution vector is written 


x=A7'b 
Substitution for A~' from Eq. (4-64) gives 


Cr Gy “Fe iG, db, 


ELEMENTS OF MATRIX ALGEBRA 139 


Thus 
1 


xtS Tan 

{Al 
The expression in parentheses is seen to be the evaluation of the following 
determinant by the elements in the first column: 


(Cy, + b,G, + +++ + b,Gn) 


bd, a2 43 Fin 
BiCyy + ByCyy #22° + OG =|O2 422° a3 7 Fan 
Db, nr An Gan 


Similar results hold for each element in x. The ith element is thus the ratio of two 
determinants, the denominator being the determinant of A and the numerator the 
determinant of the matrix obtained from A by replacing the ith column of A by b 
and leaving the other n — 1 columns unchanged. 


Example 4-12 Solve the system 
Qxy+ 4x. — x3 = 15 
X, — 3x, + 2x3 = —5 
6x, + 5x, + x3 = 28 
This may be accomplished by several methods. 


(a) Cramer’s rule First calculate |A|, and it is helpful to expand in 
terms of the elements of a column, say, the first. Thus 


2 4 saa) 
=—3 »2 4 -1 4 -1 
ee i +43 2 
6 5 1 Sig A 5 
= 2(-13) —9 + 6(5) = -5 
Then 
15 4 AS , 
—5x,=|-5 -3 2 = 15(—13) + 5(9) + 28(5) = — 10 
28 5 1 
so that x, =2 
2 (sh -+1 a! pie eit 
msacft vas ae—ule Pg aba al 
6 28 1 
= —15(—11) — 5(8) — 28(5) = —15 
so that x, =3 
and 
2 G15 a 2 4 
—5x,=|1 =3 =5|= 15% 3|43]6 4+ 28|7 | 
6 5 28 


15(23) + 5(—14) + 28(-10) = —5 
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giving 
x3=1 


Thus the solution vector is x’ =[2 3 1). 


(b) Calculation of A~' From the calculations already completed in (a), 


if/-138 -9 5 
At=-=) 1 8 =5 
23 14 -10 


3A ABS 


Methods (a) and (b) are essentially slightly different ways of laying out the 
same set of tedious calculations. A computationally much more efficient 
method, not just for small systems but more especially for large systems, is 
the elimination method. 


Thus 


(c) Elimination method Lay out the system in matrix form as 


2 emeangon: Lapiz 15 
ae 2}| x2] =| —5 
Bee ihgeenayey sg 28 


In the first step we produce zeros in the second and third positions of the first 
column by subtracting one-half the first equation from the second and three 
times the first from the third. This hey 


ree 


Next we esple a zero in the third position of the second column by 
subtracting } times the second equation from the third. 


ba Beha 


This gives an upper triangular system, which is solved for the x’s by back 
substitution. The third equation gives directly 


x3=1 


The second equation 
—5x, + 25x, = -—125 or —5x, = —15 
then gives 
xX, =3 
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and the first equation 
2x, 4g — x SS or 2x,;=4 


gives 
x,=2 


In the elimination method the inverse A-! is never calculated at all, TI 
calculations are fast and simple compared with the first two methods, but 
does not shed light on the theoretical properties of the inverse. 


4-5 THE EIGENVALUE PROBLEM 


The previous section was concerned with the solution of the set of equations 


Ax=b (4-8: 
This section is concerned with solutions of 
Ax = Ax (4-8 


where A is a known square matrix of order n, x is an unknown n-element colun 
vector, and ) is an unknown scalar. This problem will arise in a number of plac 
later in the book. It is known as the eigenvalue problem. In contrast with E 
(4-84) there are now two unknowns, a vector and a scalar, Solutions will come 
pairs; to each A there will correspond an x vector. The \’s are known 
eigenvalues, latent roots, OT characteristic roots and the x’s as eigenvectors, lat 
vectors, or characteristic vectors. 
For n = 2, Eq. (4-85), written out in full, becomes 
(a, — A) + 412%2 = 0 
ay)X, + (4x — d)x2 =0 


which may be put back in matrix form as 
(A-Al)x =0 (4-! 


Equation (4-86) is equivalent to Eq. (4-85) for any 1. If the matrix A — Al 
nonsingular, the only solution to Eq. (4-86) is the trivial x = 0. Thus for 
nontrivial solution to exist, the matrix must be singular or, in other words, hav 


zero determinant. This condition gives 
JA — Al] =0 (44 
which is known as the characteristic equation for the matrix A. This give 
polynomial equation in the unknown }. Each root or eigenvalue A, may 
substituted back into Eq. (4-86) and the corresponding eigenvector x; obtain 
For the 2 X 2 case it is easily seen that the characteristic equation is 


¥ = (ay $+ dy) + (ay422 ~ 412421) = 9 (4- 


with roots 


A= 5 (an + ay) + Man ie Gaya = 4(a4\429 — i241) | 


AQ= 5 [la + ay) - lan + an) = (44,422 — i241) | 
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In the special case of a 2 X 2 symmetric matrix, a,) = 43), the roots become 


A= 5 [Ca +ay)+ Wan = any + 402 | 


and since the content of the square root sign is the sum of two squares, the roots 
are necessarily real for a real symmetric matrix. Notice also that the characteristic 
equation may be written 


(A, —A)(A, — A) =H = (A, #A,)A+A,A, = 0 
Comparison with Eq. (4-88) shows that 
Sum of roots = A, + Az = a), + az 
= trace (sum of diagonal elements of A) (4-89) 
Product of roots = A,A2 = @;43) — aya, = |A| (4-90) 


These two properties hold true in the general nth-order case, as does the previous 
result on real roots for a real symmetric matrix. 


Example 4-13 
pals (Ee 2 
a-(F 7] 


© =|4-A 2 
aay I 
and the characteristic equation is 

¥-5r4=0 


Thus 


with roots 
A,=5 and 
For A, = 5, substitution in Eq. (4-86) gives 


ae 270% 
2°=4 x] <0 221-28 


Thus one element in the eigenvector is arbitrary, and so if x satisfies Eq. 
(4-86) for some A, then so does cx, where c is an arbitrary constant. It is 
conventional to normalize the vector by setting its length at unity, that is, 
making 


> 
5 
Hl 
i) 


ee xs = } 


which, with x, = 2x), gives 


x, = corresponding to A, = 5 


a= als 
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Similarly, it may be shown that 


re 
v5 
2 


v5 


is the eigenvector corresponding to A, = 0. 
It is seen that the eigenvectors are orthogonal, xix, = 0. If we assemble 
the eigenvectors in a matrix X, 


xX, = 


cages 
BirrteaiEY 
X=[x, x,]= edge “eee 
vS v5 
and then form X’X, we obtain the result that 
xx=xx'= [1 vl (4-91) 


We will derive this result for the general case below. Forming the matrix 
product X’AX gives 

22 Be po. calls 

5 

ys v5 c 2 3 ae el F 0 (4.92) 
Ad gapBielh Ban alot) FLO. 0 

v5 v5 v5 v5 
The diagonal matrix on the right-hand side of Eq. (4-92) displays the 
eigenvalues 5 and 0 on the main diagonal. 


Properties of Eigenvectors and Eigenvalues of a Real Symmetric Matrix of 
Order n 

In statistical applications we are mainly concerned with symmetric real matrices, 
Properties with an asterisk apply specifically to real symmetric matrices; those 
without an asterisk apply to real nonsymmetric as well as to symmetric matrices. 


1.* The eigenvalues are real. 


Suppose we have a complex eigenvalue A + ip, where 7 denotes V—1, and a 
corresponding complex eigenvector x + iy. Then 
A(x + iy) = (A + in)(x + iy) 
Multiplying out and equating real and imaginary parts gives 
Ax = Ax — PY 
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and Ay = ux + Ay 
Premultiplying the first equation by y’ and the second by x’ gives 

y’Ax = Ax’y — py’y 

x’Ay = px’x + Ax’y 
When A is symmetric, y’‘Ax = x’Ay (a scalar equals its transpose). Subtracting the 
first equation from the second then yields 

0 = p(x’x + yy) 
Since the eigenvectors must be nontrivial, x’x > 0 or y’y > 0 (or both), so 
p=0 

that is, there cannot be a complex eigenvalue. Real eigenvalues in turn generate 
real eigenvectors, that is, y = 0. 


2." Eigenvectors corresponding to distinct eigenvalues are pairwise orthogonal. 


If x,,x2 denote the eigenvectors corresponding to A,, A,, then 
Ax, = A,X, = x,Ax, =A,xx, 
and Ax, = Ax, = x{ Ax, = A2x/x 
The symmetry of A gives 
x Ax, = xAx, 
Thus 
Ayxox, = A2x)x, 
If A, * A, this last equation gives 
xix, =0 


3.* If an eigenvalue \ has multiplicity k (that is, is repeated k times), there will be 
k orthogonal vectors corresponding to this root.+ 


As an illustration of this result consider the diagonal matrix 


1 0 0 
Li jem 0) 
001 


A= 


The characteristic equation is 


(1-A)?(2-A) =0 
with roots 
A, =1 with multiplicity 2 
A, =2 


+ For a proof, see G. Hadley, Linear Algebra, Addison Wesley, Reading, MA, 1961, pp. 243-245. 
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For A, = 2, (A — ADx = 0 gives 


a) 0 Oy x, 0 
0 0 0}| x2] =0>x,=x,]! 
0 0 =1}} x; 0 


The multiple root gives 
0 
0 
1 


0 0 O}fx, 
0 1 Of}; x, 
0 0 Ojfx; 


The root with multiplicity 2 thus yields two orthogonal eigenvectors e, and e. 


1 
=0>x,= x, 4 +x; 


4.* The nth-order symmetric matrix A has eigenvalues \,, 2,..-, Xp» possibly not 
all distinct.+ Properties 2 and 3 then guarantee a set of n orthogonal eigenvec- 
10S X4,%Xqy---4 X qs SUCH that 

xx,=0 i f;i,f=1,2,.-.,0 (4-93) 


As we have seen, any eigenyector is arbitrary up to a scale factor, 
Ax,=A,x,  A(cx,) =A,(cx,) 
where c is any constant. The arbitrariness may be removed by normalizing the x 
vectors, and the most common normalization is to set the length of each vector at 
unity, that is, 
xix; = 1 i=1,2,...,7” (4-94) 
Conditions (4-93) and (4-94) define an orthonormal set of vectors. The conditions 
may be combined in a single statement, 
0 i+j 
xnj-8, = {1 42y (4-95) 
where 6,, is known as the Kronecker delta. Define X to be an nth-order matrix 
whose columns are the vectors x,,X2,---,X,- Condition (4-95) may then be 
written in the alternative form 


XX=I (4-96) 
From the definition and uniqueness of the inverse matrix it then follows that 
X= xe! (4-97) 


The matrix X is then said to be an orthogonal matrix, that is, a matrix such that its 
inverse is simply its transpose. It would be more appropriate to call it an 
orthonormal matrix, since Eq. (4-96) requires all the columns to have unit length 
as well as being orthogonal, but the former designation is the one established in 
the literature. A remarkable property of orthogonal matrices follows immediately 
from Eg. (4-97). Since the inverse is unique, 

XX’ =I (4-98) 


+ See G. Hadley, op. cit., p. 245. 
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that is, although X was constructed as a matrix with orthogonal columns, its row 
vectors are also orthogonal. Thus an orthogonal matrix is defined by 


Example 4-14 


that is, 


with roots 


A,=-1: 


XX’ = XX =I 
iW Per 5 0 
A=|]2 2 2 
0 v2 #1 
The characteristic equation is then 
1-A 2 0 
2 2-A 2 |=0 
0 v2 1-A 
(1—A)(1 +A)(-4 +A) =0 
A, =1 A4,=-1 4, =4 
Oe 2a Gall, 0 
(A-I)x=|]2 1 y2 I/x,}=]0 
0 v2 oO lls 0 
Dandi 0 || x, 0 
(A+Ix=]2 3 21) x, 0 
OO y¥2 .2 |x 0 
=o 
(A-4I)x=] 2 i 2a: J; a] =~ 
at Diba 2, "nog 
v3 vio. VI15 
seatibughdvin te eens 
wa v0. Vi5 
Boy AE 
v3 v5 yI5 


=x,= 


=x,= 


(4-99) 
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The reader can check numerically that the rows of x all have unit length and 
are pairwise orthogonal (as, of course, are the columns). 


5.* The orthogonal matrix of eigenvectors diagonalizes A, that is, 
X’AX =A (4-100) 
where A = diag{A,, A2,---,An)- 


For A, and x, we have 


Premultiplying by x’, 

x/Ax, =A,x/x,=),6,, _ using Eq. (4-95) (4-101) 
Equation (4-101) displays the i, jth element in X’AX, and collecting for all i, j 
gives Eq. (4-100). An alternative proof illustrates a useful exercise in matrix 
manipulation, 


AX =] AK) Ag. 7 AX, 
| | | 
Ay 
ad | Ne 
pen iit hey: Peer ae 3 
lal | # d, 
=XA 


Premultiplying by X’ then gives Eq. (4-100). We should not conclude from this 
result that only symmetric matrices can be diagonalized. If for any matrix A there 
are n linearly independent eigenvectors and we arrange them as the columns of a 
matrix X, then 
X-'lAX=A (4-102) 
| The contrast with Eq. (4-100) is that the columns of X are not necessarily of unit 
length, nor are they necessarily orthogonal. 


| 6. The sum of the eigenvalues is equal to the sum of the diagonal elements (trace ) 
of A. 
This property is true for any matrix, but the proof is particularly simple for 
symmetric matrices. Denote the trace of a (square) matrix A by 
tr(A) = ay, + G22 + °° + nn 


For two matrices, A of order m X n and B of order n x m, 
tr(AB) = tr(BA) (4-103) 
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AB is of order m X m. Its ith diagonal element is 


Thus 
mon 
tr(AB) = 2D a,b, 
i=1 j= 


BA is of order n X n. Its jth diagonal element is 


Thus 
nom 
tr(BA) = ) & bj, = tr(AB) 
j=li=l 
This result extends simply to 
tr(ABC) = tr(BCA) = tr(CAB) (4-104) 
Turning now to 
XAX=A 
trA = tr(X’AX) 
tr(AXX’) using Eq. (4-104) 
tr(A) (4-105) 


Thus trA 
or 


Ap HAH +++ +A, = ay tayt--+ +a 


17. The product of the eigenvalues is equal to the determinant of A. 


This result is again true for any matrix, but the proof is very simple for 
symmetric matrices. We note first that when X is an orthogonal matrix, 


|X| = +41 (4-106) 
for 
XX =I = |X| + |X| =1 
but |X| = |X’| 
Thus |X| = +1 


Returning again to 
X’AX=A 
IX'| + JA} + [X] = [A] 
Thus JA] =A,A,--- A, (4-107) 
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8. The rank of A is equal to the number of nonzero eigenvalues. 


We established in Eq. (4-53) that pre- or postmultiplication of any matrix by 
nonsingular matrices does not change its rank. Thus, again from Eq. (4-102), 
(A) = p(A) (4-108) 
and the easiest way to establish the rank of A is to determine the order of the 
largest nonvanishing determinant that can be formed from its elements. This is 
simply equal to the number of nonvanishing eigenvalues. 


9. The eigenvalues of A® are the squares of the eigenvalues of A, but the 
eigenvectors of both matrices are the same. 


Ax = Ax 

Premultiplying by A, 
A’x = AAX = x 

which establishes the result. We may note, in passing, a very useful application of 
this result in analyzing the stability of dynamic systems. Suppose y, denotes a 
vector of the values taken by a number of economic variables in time period 1, 
and suppose y, can be expressed in terms of the previous values by the system of 
equations 

Y= AY-1 (4-109) 


Even if the original specification of the system involves lags of more than one 
period, an appropriate definition of new variables can produce a derived system 
of the type of Eq. (4-109).} Successive substitution in Eq. (4-109) gives 

y, = A’yo 
where yy denotes initial values of the variables. Provided A has a linearly 
independent set of eigenvectors, 


x-AX=A 
or A=XAX"! 
Thus A? = XAX7'!XAX7! = XX"! 
So Ao = XA‘X"! 


and the elements of y, are seen to be linear combinations of the 1h powers of the 
eigenvalues of A. Thus if the system is to be stable, we need 


FY es Foil ee ree 


10. The eigenvalues of A~' are the reciprocals of the eigenvalues of A, but the 
eigenvectors of both matrices are the same. 
Ax = dx 


+See G. Chow, Analysis and Control of Dynamic Economic Systems, Wiley, New York, 1975, pp. 
21-35. 
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Premultiply by A~', 


or 


which establishes the result. 
11. Each eigenvalue of an idempotent matrix is either zero or unity. 


By property 9 


A’x = x 
But when A is idempotent, 
A’x = Ax = Ax 
Thus 
A(A - 1)x =0 


and since any eigenvector x is not the null vector, 
A=0 or A=1 


12. The rank of an idempotent matrix is equal to its trace. 


This follows from properties 6, 8, and 11, 


p(A) = p(A) from property 8 
= number of nonzero eigenvalues 
= tr(A) from property 11 
= tr(A) from property 6 


4-6 QUADRATIC FORMS AND POSITIVE DEFINITE MATRICES 


We have already introduced quadratic forms briefly in Sec, 4-2 and have seen that 
there is no loss of generality in considering only symmetric matrices. For a2 * 2 
symmetric matrix A and a two-element column vector x, the quadratic form is 


WAX = Gyxf + 2a).x,X2 + Gyx} 
For a third-order matrix 
XAX = 4X7 + 2ayyX xX, + 2a45x1X5 
+ yx} + 2ay,Xpx; 


+ 33x53 
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For the general nth-order case 


X/AX = @,,X} + 2a49X,Xz + 2aj4x,x, + +++ + 2a),x)x, 


+ Gyx} + 2ay5Xyx, + +++ + ay, xXx, 
+ yx? +--+ + 2ay,x5x, 
+ a,,x? 


Definitions 


If x‘Ax > 0 for all x + 0, the quadratic form is said to be positive definite and A is 
said to be a positive definite matrix. 

If x‘Ax > 0 for all x + 0, the form and matrix are positive semidefinite. 

Reversing the above inequality signs defines negative definite and negative 
semidefinite matrices, respectively. If a form is positive for some x vectors and 
negative for others, it is said to be indefinite. 

It is important to have tests for positive definite matrices. 


1. A necessary and sufficient condition for the real symmetric matrix A to be 
Positive definite is that all the eigenvalues of A be positive. 


To prove the necessary condition assume x’Ax > 0. For any eigenvalue \, 
Ax; = Ax; 
Premultiplying by x’, gives 
x, Ax, = A,x/x,; = A, 


Since x'Ax > 0 holds for any x * 0, it holds for each eigenvector, and so A, > 0 
for all i. To prove sufficiency we assume all \, > 0 and show that x’Ax > 0. Since 
a symmetric matrix has a full set of n orthogonal eigenvectors X,,X7,..., X,, any 
nonnull vector x may be expressed as a linear combination of the eigenvectors 


X = ¢)X, + Xz +-++* + ¢,X, 
Thus Ax = c,Ax, + c, AX, + +++ + ¢,AX, 
= A,X, + CoAQK. + 1+ + CHADXn 


XAX = (0,K, + CKq + 21+ + C4X,) (CAR + 2A? Fo + CyA Xn) 


2 
cPA, + c3A, + --> + CFA, 
since 


0 TE Jee hy 
tie eee ae VF = Lyin 
xix; = 6; i joy i,j 
Since all \, are assumed to be positive, x’Ax > 0. 


2. A necessary and sufficient condition for a real symmetric matrix A to be positive 
definite is that the determinant of every principal submatrix be positive. 
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The principal submatrices of A are a set of n submatrices such as 
a; ay a 
Gy ta Pe on %. 
ais , yi ii BR gs «xis (A 
aii ayy 
Agi Any Ake 
More conventionally one takes the upper submatrices 
ay My Fi 
A,=[a,] A= 


2 Aig = [421 922 423 


By Fi2 
3) 923 


When A is positive definite, x’‘Ax > 0 for any nonzero x. Thus we may consider 
an x vector whose first r elements are nonzero and whose last n — r elements are 
zero, that is, 


x’ =[x, 0] 
Then 


x’Ax = [x’, | *: [5] = x'A.x, 


where A has been partitioned by the first r and the last n — r rows and columns 
and the asterisks denote the remaining submatrices in A, which get wiped out by 
the zero subvector in x. Since 


x’Ax > 0 
it follows that 


x, Ax, >0 
Thus by the previous condition all the roots of A, are positive, and so 
|A,| >0 


A suitable choice of x vectors then gives the necessary and sufficient condition for 
A to be positive definite as 


IA, > 0, [Ag] > 0, As} > 0,..., JA] > 0 (4-110) 
Finally we state a number of useful theorems on positive definite matrices. 


1. If A is symmetric and positive definite, a nonsingular matrix P can be found 
such that 


A = PP’ (4-111) 


We know that the matrix of eigenvectors of A can be used to diagonalize A, 
that is, from Eq. (4-100), 


XAX=A 
which gives 

A= XAX’ (4-112) 
When A is positive definite, all its eigenvalues are positive. Thus A may be 
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factored into 
A=A2A'2 


iy 
ce 


where A= 


Substitution in Eq. (4-112) gives 
A=XN7A'2xX' = (XA'2)(XA'?)’ 
which gives Eq. (4-111) with 
P=XAi? 
and P is nonsingular since it is the product of nonsingular matrices. 


2. If Ais n X n and positive definite and if P isn X m with p(P) = m, then PAP 
is positive definite. 


Clearly, PAP is an m X m symmetric matrix, and for any m-element vector 


y(P’AP)y = x’Ax 
where x = Py. Thus x is seen to be a linear combination of the m linearly 
independent columns of P, and so x = 0 if and only if y = 0. Thus P’AP is 


positive definite. 
The final three results are stated without proof. 


3. IfA isn X m with rankm <n, then AVA is positive definite and AA’ is positive 


semidefinite. 
4. If A is n X m with rankk < min(m, n), then A'A and AA’ are each positive 


semidefinite. 
5. If A and B are positive definite matrices and A —B is also positive definite, 


then B~' — A~' is positive definite.t 


4-7 MAXIMUM AND MINIMUM VALUES 


its on maxima and minima in matrix 


It is convenient to express the main resu i 
ned as a function of n independent 


terms. Consider a scalar variable y defi 
variables, 
¥ = (15 Xa9-+0 Xn) 


+A proof of this result is given in P. Dhrymes, Introductory Econometrics, Springer-Verlag, New 


York, 1978, App. A2.13. 
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which may also be written 
y =f) 
The first-order or total differential of the function is defined as 
dy = f,dx, + fydx, +--+ +f, dx, (4-113) 


where 


and the dx, indicate arbitrary changes in the x,. For small dx, the first-order 
differential gives the approximate value of the resultant change in y. Denoting the 
vector of partial derivatives by f and the vector of differentials by dx, 


fi dx, 
fala saul 
“adil i : 
tn dx, 
the first-order differential of y is simply 
dy =f’ dx (4-114) 


If y has a stationary value at a point 
x* = [xfxg +++ xf] 


then dy = 0 for all points in the neighborhood of x*. For such points dx * 0, and 
so from Eq. (4-114) the necessary condition for a stationary value is 


f=0 


that is, all partial derivatives are zero at the stationary point. 

A stationary point may be a maximum, where the value of the function is less 
at all points in the neighborhood of x*; a minimum, where the value of the 
function is greater at all points in the neighborhood of x*; or a saddle point, 
where the value of the function increases in some directions from x* and 
diminishes in others. One may distinguish between these possibilities by means of 
the second-order differential d*y. The second-order differential may be found by 
totally differentiating the first-order differential. It is an approximation to the 
change in dy as we move away from the point x*. Clearly, for a maximum yalue 
dy will decrease from zero to some negative value, so d7y will be negative, and 
conversely for a minimum value dy will be positive. For a saddle point d*y will 
be positive for some dx and negative for other dx.} Totally differentiating Eq. 


} It is possible, but extremely rare, to have d?y = 0 for some dx. Such complexities are ignored 
here. 
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(4-113) gives 


7) 
a’y = 3, Lie FP ifpdxy +7- + f,dx,,] dx, 


ti) 
hig abhi + fdr +--+ + fy Bq] bx 


a 
foo aes [fidx, + frdx, +--+ fy, dx, ] an 


) 
= fy dx? + fi dx, dxq + 2fiy dx dg to + 2fin AX AXy 
+ fy dx} + 2 fy dxz dxy + +++ + 2 fz, dx2 aX, 


+ San OX 
where 
a? elie ae 
Spadina Fre, fori + j, alli, j 


and dx? indicates the square of the differential dx,. The second-order differential 
is thus seen to be a quadratic form in dx. The matrix of the quadratic form is the 
symmetric Hessian matrix of second-order partial derivatives, which we will 


denote by 


a fir fa Fin 
Rita 7 fa fo fan 
Fin San Snn 
and we may write 
d’y = dx'F dx (4-115) 


Thus d?y is positive or negative as F is positive definite or negative definite. To 
summarize, the conditions for a maximum or a minimum at a point x* are as 
follows: 


First-order Second-order 
condition condition 


2 
Maximum f= eons 0 F= ie is negative definite 


ax ax 
2. 
Minimum fa 22— 0 F = 2 is positive definite 
ox ax? 


where f and F are evaluated at x”. 
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Constrained Extrema 


In finding stationary values of y = f(x), X2,-.-, X,,) the x’s were assumed to be 
independent variables. Thus we could specify n arbitrary differentials 
dx,, dx>,..., dx,. In some problems, however, the x’s may be subject to one or 
more constraints, and we have to find a maximum or minimum value of y subject 
to the constraints. We will assume for the moment that the function has a single 
maximum or minimum value and state the problem formally as follows: 


Find the x* vector which maximizes (minimizes) y subject to the m<n 
constraints 


g(x)=0 j=1,2,...,m,m<n 


Define the column vector 


8, (x) 
82(x) 
F369 Reni |e 
8m (X) 
and a column vector of m Lagrange multipliers, 
AL 
hs 
hn 
Using these we define a new objective function as 
@ = f(x) — Ne(x) (4-116) 


Thus @ is a scalar quantity, which is a function of the m + n variables in \ and x. 
The first-order condition for a stationary value of ¢ is that all m + n first-order 
partial derivatives should vanish, that is, 


ap af_ a 


Gx > 9x (NB) 
~ =| 9x ax =0 (4-117) 
g 
ay a(x) 
Some care is required in the interpretation of 
9 iy, 
Fx (N80) 


Since 
N’g(X) = Ay gi (Xp. X25-++5 Xn) + ADB (X15 X25--+ Xn) 
+922 + Ngai (XtyX20~-teXR) 
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we have 
a(X'e(x)) ag B2 ag a 
=A so + Bom = 28 
ax; ‘ax, tRtaxphs | ax, Ox, 
a(Na(x)) _ , 981 ag, 88m) OB 
linac ian 6 apa aes aaa 
where 
98) 
ax, 
982 
98 | ax, i= 42 n 
3 Soh es 
am 
9g; 


Since af/@x is an n X 1 vector, we need to arrange for (0/x)(\’g(x)) to have n 
rows, Defining G as the n X m matrix of partial derivatives, 


ag, 982 98m 
Ox, Ox, Ox, 
es 25) ap ASa ee eum lee) |OB A ALOR, alee: 
Ox, OX OX, Ox, OX Ox, 
‘dey Gg, OB 
Ox, OX, OX, 
we have 
dp _ Of _ 
URE ae 
and the first-order conditions for a stationary point are 
of _Gx=0 
x (4-118) 
a(x) = 0 


The second equation in Eqs. (4-118) ensures that the stationary value satisfies 
the constraints. To distinguish between maxima and minima, we must still 
examine whether the quadratic form in Eq. (4-1 15) is negative definite or positive 
definite, but now only for dx vectors which do not violate the constraints. Totally 


differentiating the jth constraint 
(1 oasis eal) 


gives 
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There is a similar condition for each constraint. Thus the dx vectors which do not 
violate the constraints are given by 


G'dx=0 (4-119) 


In many cases the F matrix consists only of constants, and so its definiteness can 
be established independently of any x values. 


PROBLEMS 


4-1 Expand (A + B)(A — B) and (A — B)(A + B). Are these expansions the same? If not, why not? 
How many terms are in each? 


4-2 Given 


a ee 
aeRO” <3 eae a 
cd ( 


Calculate (AB)’, B’A’, (AC)’, and C’A’. 

4-3 Find all matrices B obeying the equation 

0 Ijp_ {0 0 1 
0 2/8 0 0 2 
4-4 Find all matrices B which commute with 


adel 
[5 3] 
to give AB = BA, 


4-5 Write down a few matrices of order 3 x 3 with numerical elements. Find first their squares and 
then their cubes, checking the latter by using the two processes A(A*) and A*(A). 
4-6 Prove that diagonal matrices of the same order are commutative in multiplication with each other. 


4-7 Let 
Gangs oF 
J=);0 1 0 
Es 0.6: 


Write out in full some products JA, where A is a rectangular matrix. Describe in words the effect on A. 
Do the same with products of type AJ. Find J?. 


4-8 If 
010 
Y= 10 0/1 
00 0 
find V? and V3, Examine some products of the type VA, VA, and V’A. 


4-9 Given 
ee A aed sat 
A=/12 6 9 and E=/1 0 0 
the 6 001 
Calculate |A|, |E|, and |B|, where B = EA. Verify that |B| = |E||A|. 
4-10 Show that 
1 1 
abe 
a® BF het 


= (c= 6)(¢=a)(b- a) 
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4-11 If (x,, y;) and (x2, y2) are points on the x, y plane, show that the equation 


Bi ae 
% Wy =O 
Xo yo | 


represents a straight line through the two points. 
4-12. Prove that the determinant of a skew-symmetric matrix of odd order vanishes identically. (If A is 


a skew-symmetric matrix, then A’ = —A.) 
4-13 Show that the matrix 


1 =2 1 
ve v5 ¥30 
Eee 
Ole ve v0 
cM Fy ot = 
v6 ¥30 
is orthogonal, that is, that Q’ = Q™ a 
4-14 If the u, are normal variables with 
E(u;) =0 


E(u?) =07 i= 1.5m 
E(uju) = 0 ty 
show that E(u'Au) = 07 tr(A). 


4-15 Given 
Lest 
[Wee 
SAG | 
itr) 
Compute 


A=(1,- x(x’x) 'x’) 


Show that A is idempotent and determine its rank. Find the characteristic roots and the 
associated characteristic vectors of A, and hence obtain the orthogonal matrix which diagonalizes A. 
e order, Prove that AB and BA possess identical 


4-16 A and B are nonsingular matrices of the sam 
characteristic roots. Show also that no such matrices can be found to satisfy the equation 


AB - BA=I 
(Cambridge Economics Tripos, 1967) 


4-17 Evaluate the characteristic roots and vectors of 


5 =6" =6 
A=|-1 4 2 
3-6 -4 
4-18 Examine the following quadratic forms for positive definiteness: 
(a) 6x? + 49x} + 51x} — 82x2x3 + 20%1%3 — 4x x2 
(b) 4x? + 9x3 + 2x3 + 8x2%3 + 6x3x1 + 6x\x2 é 
(Cambridge Economics Tripos, 1968) 
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4-19 (a) Given that 
A StS 
ADD: 22 
2) 2 BD) 


find A" for n > | and B” forn> 1. 
(b) If A is defined as 


> 
i] 
BIN win wie 


show that A is orthogonal. 
Prove that the product of two orthogonal matrices of the same order is also an orthogonal 
matrix. 
(UL, 1967) 
4-20 X is a square matrix of order n and a is ann X | vector. Find 
9(a'Xa) 
: aX 
(a) When the elements of X are independent. 
(b) When X is symmetric. 
Note: If X is a matrix whose elements are variables x,, and f(X) is a scalar function, then 


z= £0) 
ax 


is a matrix of the same order as X such that 


_ F) 


eae Bo 7 


CHAPTER 


FIVE 
THE k-VARIABLE LINEAR MODEL 


Chapter 2 contains a fairly complete treatment of the two-variable (k = 2) model. 
Some of the algebra of the three-variable (k = 3) model was developed in Sec. 4 
of Chap. 3. Section 1 of Chap. 4 indicated the power of matrix algebra to give a 
compact representation of the general k-variable model. It is now time to give a 
complete statistical treatment of the k-variable model. To facilitate this treatment 
we first provide a review of some basic statistical results in matrix form, These 
results are extensions of the material on matrix algebra in Chap. 4 and the 
statistical material in various sections of App. A. 


5-1 PRELIMINARY STATISTICAL RESULTS 


Let x denote a vector of random variables X,, D.C reir Each variable has an 


expected value 
p, = E(X%) Del; 2,.0.5 1 


Collecting these expected values in a vector }1, gives 


E(X,) By 
eas ence he (51) 
£(x,)| [ae 
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The application of the operator E to the vector x means that E is applied to each 
element of x. The variance of X,, by definition, is 


var( X,) = E{(X, ~ 1,)'} 
and the covariance between X, and X; is 
cov( X,, X;) = E{(X, — #,)(%, - #,)} 


If we define the vector x — » and then form 


(X% - 4) 
EX(x — p)(x — p)} = E be A [X= y)(%9 = Ba) 0 (Xn = Bnd] 
(X,~ Ba) 
E(X, - 4.) E(X,—m)(%— a) EC — eG — Bad) 
E( Ke~ ta)(Mi= th) ee eB itae 9s 0h 8 ECA — Ha) (Hy ~ Hy) 
E(X,— a)(X = th) E(%, = Ba)(%= a) E(X,— Ba) 


we see that the elements of this matrix are the variances and covariances of the X 
variables, the variances being displayed on the main diagonal and the covariances 
in the off-diagonal positions. The matrix is known as a variance-covariance matrix 
or, more simply, as a variance matrix or a covariance matrix.} We will generally 
refer to it as a variance matrix, denote it by var(x), and 


var(x) = E{(x — p)(x — p)'}= 2 (5-2) 


The variance matrix 3 is clearly symmetric. It is important to determine whether 
2 is positive definite or not. Define a scalar variable Y as a linear combination of 
the X’s, that is, 


Y=(x-p)‘c (5-3) 


where c is any arbitrary n-element column vector. Squaring Eq. (5-3) and taking 
expectations gives 


E(¥*) = E{e'(x — p)(x — p)‘e} 
= CE{(x — p)(x — p) Je 
=cie 


There are two useful points to notice about this development. First of all, 
(x — p)’c is a scalar, thus its square may be found by multiplying it by its 
transpose. Second, whenever we have to take the expectation of a complicated 
matrix expression, the E operator may be moved to the right past any vectors or 
matrices consisting only of constants, but it must stop in front of any expression 


+ Alternative expressions for the variance-covariance matrix are cov(x) and V(x). 
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involving random variables. Since Y is a scalar random variable, E(Y*) > 0. Thus 
eZe > 0 
and & is positive semidefinite. But 


E(Y¥?)=0=>Y=0 


which, from Eq. (5-3), means that the X deviations (X, — pt), (X2 — H2)s+-+6(Xn 
— p,,) are linearly dependent. Thus 


> is positive definite, provided no linear dependence exists among the X’s. 


The n random variables will have some multivariate probability density function 
(pdf) written 
p(x) = p(X, Xa.-+-> X,) 


which is simply some formula or rule giving the likelihood of various combina- 

tions of X values. The most important multivariate pdf is the multivariate normal. 

The univariate normal distribution is specified once its mean #4 and its variance 0” 

are given, The multivariate normal is similarly specified in terms of its mean 
vector p and its variance matrix >. The formula is 

1 | 1 eit 
x) = ooo OP TO pb es 5-4 
p(x) Gay 7E)7 pl — g(x — Wy | B) (5-4) 


A compact shorthand statement of Eq. (5-4) is 
x ~ N(p, 2) 
to be read, “the variables in x are distributed according to the multivariate 


normal law with mean vector p and variance matrix 3.” When n = 1, 2 = o? and 


Eq. (5-4) becomes 


X= 


which is the familiar univariate normal density. When n = 2, if we use p to denote 
the correlation between X, and X, the variance matrix becomes 


2 

oO 0 9; 

sa] % P82) with [2] = ofo7(1 — 0”) 
p0\0,  % 

Notice that |3| > 0 unless p* = 1, so that the variance matrix is positive definite 

unless there is perfect linear correlation between the two variables, in agreement 

with the general result above. Substitution in Eq. (5-4) gives 


As 1 el! Xm)" 
PO. %) = —— ee wall % 
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. An especially important case of Eq. (5-4) occurs when all the X’s have the 
same variance o? and are all pairwise uncorrelated.} Then 


Sel 
with 
|Z] =07", =! =) 
o 
1 1 } 
and p(x) = aap writ — p)(x- »| (5-5) 


Equation (5-5) thus factorizes into 
n 


P(X, Xayee X) = TT (peew|- hx . »|} 


i= 


= P(X) p(X) --- p(%,) 


so that the multivariate density is the product of the separate marginal densities, 
that is, the X’’s are distributed independently of one another. This is an extremely 
important result. Zero correlations between normally distributed variables imply 
statistical independence. This result does not necessarily hold for variables which 
are not normally distributed. Notice carefully that these results depend on zero 
correlations in the population and not on zero sample correlations. 

A more general case of this result may be derived from Eq. (5-4). Suppose = 


has the form 
Dip oe 
z= | 0 eal (5-6) 


where 2), is square of order r and &,, is square of order n — r. The form of Eq. 
(5-6) means that each and every variable in the set_Y, 1» X,,..., X, is uncorrelated 
with each and every variable in the set X,,,, X, 42:-++, X,. Applying a similar 
partitioning to x and p, 


(x = pyBO N(x = B) = (x, = wy ENR, = hy) + (Ky — a)/Baa!(K2 — 2) 
using Eq. (4-75) for 3~'. Also from Eq. (4-79) 
12] = 12,112.21 
Making these substitutions in Eq. (5-4) gives 


vs) = { Gaya ae 7 mI EAle — ma] 


" be aaah ae - F(x, = fo)'B50'(x, — »)]) 
2 


{ The assumption of a common variance is only made for simplicity. All that is required for the 
result is that the Z matrix be diagonal. 
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that is, 

P(x) = p(X) p(%2) 
so that the first r variables are distributed independently of the remaining n — r 
variables. 


Distributions of Quadratic Forms 


Suppose 

x ~ N(0,1) 
that is, the n variables in x have independent normal distributions, each with zero 
mean and unit variance. In other terminology, the X’s are independent standard- 
ized normal variables. The sum of squares x’x is a particularly simple example of 
a quadratic form with matrix I, From the definition of the x? variable, 

xx ~ x7(") 
for x?(n) is the sum of the squares of n independent standardized normal 


variables. 
Suppose now that 


x ~ N(0, 071) (5-7) 
The variables are still independent and have zero means, but each X has to be 
divided by o to yield a variable with unit variance. Thus 


XP, XE Xt 2 
AL yy S24... pot ~ x(n) 
ey wean o 
that is, 
divx ~x?(n) (5-8) 
o 


or x(o71) ‘x ~ x°(”) (5-9) 


Equation (5-9) shows explicitly that the matrix of the quadratic form is the 


inverse of the variance matrix. 
Suppose now that 


x ~ N(0, 2) (5-10) 
where 3 is a positive definite matrix. The equivalent expression to Eg. (5-9) would 
now be 

x’7'x ~ x?(n) (5-11) 


This result does in fact hold, but the proof is no longer direct since the X 
variables are no longer statistically independent. The trick is to transform X ’s 
into Y’s, which will be independent standardized normal variables. Since = is 
positive definite, by Eq. (4-111) there exists a nonsingular matrix P such that 


z= PP 
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which gives 
Sete (Py Pes etand VOR S(p-') = | (5-12) 
Define an n-element y vector as 
y = P>'x 


The Y variables are multivariate normal since they are linear combinations of the 
X’s, 


E(y) = P-'E(x) = P-'0=0 
and var(y) = E{P~'xx’(P~')’} 
= P-'R(P>!) 
=I 
from Eq. (5-12). Thus the Y’s are standardized normal variables and 
yy ~x7(n) 
But 
yy =x(P) Pax =x'S7'x 
from Eq. (5-12). So 
x’D>!x ~ x?(n) 
which is the result anticipated in Eq. (5-11). 
Assume again 
x ~ N(0,1) 


and now consider the quadratic form x’Ax where A is idempotent with rank 
r <n. If we denote the matrix of eigenvectors of A by Q, then 


QAQ=A= 1 (5-13) 
0 
oe n—r terms 
“0 

where A will have r units and n — r zeros on the main diagonal. Define 

y=Qx 
Thus 

x = Qy 


since Q is orthogonal. Then 
E(y)=0 
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and var(y) = Exyy’} 
= E(Qxx'Q) 


since Q’Q = I. Thus the Y’s are independent standardized normal variables. The 
quadratic form may now be expressed as 
x’Ax = y'Q’AQy 
=¥24+ Y¥p+-++¥7 
using Eq. (5-13). Thus 
x/Ax ~ x?(r) 
The general result is 


If x ~ N(O, 071) and A is idempotent of rank r, wax ~ x(r) 
o 


Independence of Quadratic Forms 


Suppose x ~ N(0, 071) and we have two quadratic forms 
x’Ax and = x'Bx 

where A and B are symmetric idempotent matrices of the same order. We seek the 
condition for the two forms to be independently distributed. Because the matrices 
are symmetric idempotent, 

x/Ax = (Ax)'(Ax) 
and x'Bx = (Bx)'(Bx) 
If each of the variables in the vector Ax has zero correlation with each variable in 
Bx, they will be distributed independently of one another, and hence any function 
of the one set of variables, such as x’Ax, will be distributed independently of any 
function of the other set, such as x'Bx. The covariances between the variables in 
Ax and those in Bx are given by 

E{(Ax)(Bx)’} = E{Axx’B) 


= o “AB 
These covariances (and hence the correlations) are all zero if and only if 
AB=0 (5-14) 


Since A and B are symmetric, the condition may be equivalently stated as 
BA = 0; the one implies the other. Thus two quadratic forms with idempotent 
matrices will be distributed independently if the product of the idempotent matrices is 
the null matrix. 


Independence of a Quadratic Form and a Linear Function 


Assume x ~ N(0, 071). Let x’Ax be a quadratic form with A a symmetric 
n and let Lx be an m-element vector, each element 


idempotent matrix of order 
f the X’s. Thus Lis of order m X n, and we note that 


being a linear combination o 
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it need not be square or symmetric. If the variables in Ax and Lx are to have zero 
covariances, we require 


E{Axx’L} = o7AL’ = 0 
or equivalently 
LA=0 (5-15) 


5-2 ASSUMPTIONS OF THE LINEAR MODEL 


The first basic assumption of the model is that the vector of sample observations 
on Y may be expressed as a linear combination of the sample observations on the 
explanatory X variables plus a disturbance vector, that is, 


1. y = Bx, + Bx. +++: + BX, FU (5-16) 


where each vector is a column vector of n elements. The x, vector is a column of 
units to allow for an intercept term, Each of the remaining x, vectors (i = 
2,3,...,k) denotes the sample observations on a specific explanatory variable. 
The B’s are unknown population (model) parameters, but even if we knew their 
values, the linear combination (8x, + --- + B,x,) would not determine the y 
vector exactly, for economic relations are stochastic, not exact. Thus u is a 
disturbance vector measuring the discrepancies between the linear combination 
and any actual sample realization of Y values.+ 
Equation (5-15) may be expressed in matrix form as 


y=XB+u (5-17) 
where 
Y, B, uy 
Y, Ter tt | B uy 
iene. X= | Ky Mahe Xp B= || 
y, | | | b, u, 


The central problem is to obtain an estimate of the unknown B vector. To make 
any progress with this we need to make some further assumptions about how the 
observations on Y have been generated. 


2. E(u)=0, — thatis, — E(y) = XB 


To illustrate the meaning of this assumption, let us assume that the X 
variables measure family income and various other family characteristics and Y 
denotes family expenditure on, say, travel. The first row of the X matrix is some 


+ An outline of the various reasons for the introduction of the disturbance term has already been 
given in Sec. | of Chap. 2. 
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specific set of numbers for family income, size, and composition. Let s, denote a 
row vector consisting of these numbers. Then 


E(Y,) = 8B 


is the average, or expected, level of travel expenditure for this type of family. 
However, if we observe the actual travel expenditure of a family with these 
characteristics, it may be greater than the expected level, and the expenditure of 
another family with the same characteristics may well be less than the expected 
value, Or if we observe the travel expenditures of the same family in different 
periods of time, these may be expected to fluctuate around the mean value. 
However, if the theorist has done a good job in specifying all the significant 
explanatory variables to be included in X, it is reasonable to assume that both 
positive and negative discrepancies from the expected value will occur and that, 
on balance, they will average out at zero, that is, 


E(u,)=9 
Similar considerations apply to each row of X, and so we have 
E(u,) 0 
E(u) 0 
Ew)=| 2 ||. 
E(u,)} | 


3. E(uu’) = 071 


Since E(u) = 0, E(u’) is a variance matrix. This assumption gives 


var(u,) cov(u;, 2) °° cov(u,, U,) o 0 - 0 
coten i) tend sthaseilate |= | Cuitaya® 
cov(u,, uy) COV(u,, 2) °° var(u,,) 0 lo biter wo? 


This is a double assumption, namely: 


«Each u distribution has the same variance. 
«All disturbances are pairwise uncorrelated. 


‘0 as homoscedasticity (or homogeneous variances) 
and its opposite as heteroscedasticity. If the sample observations related to travel 
expenditures of a cross section of households, the assumption of homoscedasticity 
would probably not be a reasonable one, since low income families will almost 
certainly have low average expenditures on travel and also a low variance of 
actual travel expenditure about the average, while high income families will tend 
to display both higher mean levels of expenditure and greater variance about the 
mean. The second part of this assumption—all disturbances being pairwise 


The first property is referred t 
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uncorrelated—is a very strong assumption indeed. Again, in the context of the 
travel example it means that the size and sign of the disturbance for any one 
family has no influence on the size and sign of the disturbance for any other 
family. This is not to deny the possibility of “keeping up with the Joneses” as an 
important economic and sociological fact. If such a phenomenon does exist, it 
would be more appropriately characterized in the specification of the X variables. 
If the sample data related, say, to aggregate travel expenditure over a period of 
years, the same assumption means, for example, that unusually heavy expenditure 
in one year does not tend to be associated with unusually low (or high) 
expenditures in the next year or indeed in any subsequent year. 


4. p(X)=k 


This assumption states that the explanatory variables do not form a linearly 
dependent set. For example, if we had just two explanatory variables, X, and X,, 
and this assumption was not fulfilled, there would then exist an exact relationship 


X, = ¢, +X, (5-18) 
which, combined with the hypothesized 
Y= B, + BX, + BX; + u (5-19) 
gives 
Y = (B, + Bc,) + (B, + B3c,) X + u (5-20) 


The constants c, and c, can be determined exactly, and we can estimate the 
intercept and slope of Eq. (5-20), but there is no way to obtain estimates of the 
three 8 parameters. 


5. X is a nonstochastic matrix. 


This assumption at first sight seems incongruous. It means that if we take 
another sample of n observations, the X matrix of explanatory variables remains 
unchanged, the only source of variation then being in the u vector and hence in 
the y vector. However, the social sciences are notoriously difficult for being 
observational and nonexperimental so that in general the X variables are not 
subject to experimental control by the social scientist. There are three main points 
to be made about this assumption. First of all, in spite of the remarks above, there 
are cases where the X data can be controlled. In a cross-section survey, the sample 
design may call for the inclusion of certain numbers of families with specific 
characteristics, and sampling is continued until these specifications are met. 
Second, even if it is not in fact feasible to control the X data precisely, it is still 
useful to be able to make statistical inferences which are conditional on the X 
values actually present in the sample. In this light it is very much an assumption 
of convenience in that it simplifies dramatically the derivation of several basic 
statistical results. Third, once these simple results have been derived, it is possible 
to weaken the assumption to allow the X variables to be stochastic, but distributed 
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independently of the disturbance term, and then see what modifications of the 
earlier results are required. 


6. The u vector has a multivariate normal distribution. 


Assumptions 2, 3, and 6 may then be combined in the single statement: 
u~ N(0, 071) (5-21) 


5-3 ORDINARY LEAST-SQUARES (OLS) ESTIMATES 


The most frequently used estimating technique for the model outlined in Sec. 5-2 
is least squares. The hypothesized model is 

y=XBPt+u (5-22) 
Let b, denote any arbitrary k-element vector. This in turn serves to define a 
vector of errors, or residuals, 

e, = y — Xby (5-23) 
The least-squares principle for choosing by is to minimize the sum of the squared 
residuals €@,. From Eq. (5-23) 

ee, = (y - Xb,)(y - Xb) 
= yy — 2b,X’y + b,X’Xb, 


Thus Heeee) = —2X'y + 2X’Xb, (5-24) 
* 


The necessary condition for a stationary point requires that we set Eq. (5-24) 
equal to the 0 vector. Denoting the resultant OLS solution for b, simply by b gives 


(X’X)b = X’'y (5-25) 
These are referred to as the OLS normal equations. Assumption 4 ensures that X’X 
is nonsingular. Thus an equivalent expression for b is 


b = (XX) ‘X’y (5-26) 
The vector of OLS residuals is likewise denoted by e, where 
e=y—Xb (5-27) 


Using this expression to substitute for y in Eq. (5-25) gives 
(X’X)b = (X’X)b + X’e 
xje 0 
xe 0 
Thus scene | al Pate (5-28) 


/ 
xe 0 
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This is a fundamental OLS result. The first element in this equation gives 
é=0 

that is, the residuals from the OLS regression always have zero mean, provided 
that the equation contains a constant term. The remaining elements in Eq. (5-28) 
state that the residual has zero sample correlation with each X variable. 

To establish that the stationary point does indeed correspond to a minimum 
of the sum of squares, differentiate Eq. (5-24) once again with respect to b to 
obtain 


a (eyes) 
ae 5-29 
aw, (XX) (5-29) 
From Sec. 4-7 this gives a minimum provided X’X is positive definite. To establish 
this, let d be any nonnull k-element vector, and consequently define an n-element 
vector ¢ as 


c= Xd (5-30) 


The assumption that X has full column rank ensures that ¢ is nonnull; otherwise 
Eq. (5-30) would express a linear dependence between the columns of X. Thus 


e’c = d'X'Xd > 0 
and X’X is positive definite. 
Returning to Eq. (5-26), 
b = (X’x) 'x’y 
and substituting 
y=XB+u 
gives 
b=B + (XX) 'X'u (5-31) 
Taking expectations gives 
E(b)=8 (5-32) 


since r 
E{(X’X)'X’u} = (Xx) 'x’E(u) = 0 


by assumptions 2 and 5. The OLS estimator is thus a linear unbiased estimator. 
The linearity property refers to linearity in y (or u) as is seen in Eq. (5-26) or Eq. 
(5-31), for each element in b is a linear combination of the elements of y (or u), 
the weights being functions of the X data which are nonstochastic. The unbiased- 
ness is established in Eq. (5-32). 

Next we derive the variance-covariance matrix of the OLS estimators. From 
Egs. (5-31) and (5-32), 


b — E(b) = b — B= (XX) 'X'u 
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Thus 
var(b) = E{(X’X)” ‘X’uu’'X(X’X) '} 
= (XX) 'X’o7IX(X’X) ' from assumptions 3 and 4 
= 0?(X’x)! (5-33) 


since I may be suppressed at will and the scalar o? moved in front or behind 
matrices, The elements on the main diagonal of Eq. (5-33) give the sampling 
variances of the corresponding elements of b, and the off-diagonal terms give the 
sampling covariances. The most important result in least-squares theory is that no 
other linear unbiased estimator can have smaller sampling variances than those of 
the OLS estimator in Eq. (5-33). OLS estimators are thus said to be best linear 
unbiased estimators (b.1.u.e.), that is, to have minimum variance within the class of 
linear unbiased estimators. This result is known as the Gauss-Markov theorem. 
The following proof is somewhat roundabout, but it has the advantage of 
establishing a further important result at the same time. Let ¢ denote an arbitrary 
k-element column vector of known constants and define a scalar quantity m as 


n= cB (5-34) 
If we choosee’=[0 1 0 --- 0], thenp= B,. Thus we can use Eq. (5-34) to 
pick out any single element in B. Or if we choose” 
Y= [1 Xonar Xantr °77 Xone) 
then 
p= E(Y,. i) 


which is the expected value of the dependent variable Y in period n+ 1, 


conditional on the X values in that period. 
We wish to consider the class of linear unbiased estimators of 1. Thus define 


a scalar m which will serve as a linear estimator of 1, such that 
m=a'ty =a'XB + au (5-35) 
where a is some n-element column vector. The definition ensures linearity. To 
ensure unbiasedness we have 
E(m) = a’XB + a’E(u) 
= a'XB 
=cB 
only if 
aX=c' (5-36) 
From Eggs. (5-35) and (5-36), 
var(m) = E{a’uu’a) 
=o7aa 
which derivation uses the fact that since a‘u is a scalar, its square can be written 
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as the product of its transpose and itself. The problem is thus to choose a to 
minimize a’a subject to the k side conditions a’X = ¢’. Define 
$ = a‘a — 2X (X‘a — c) (5-37) 


Here X is a column vector of k Lagrange multipliers, and the side conditions 
(5-36) have been transposed to make the multiplications in Eq. (5-37) conform- 
able. Differentiating 


“ = 2a-2XA=0 (5-38) 
and “ = 2(X’a—c) =0 (5-39) 
Premultiplying Eq. (5-38) by X’ and using Eq. (5-39) gives 
c= X’a= XXX 
Thus 
X= (XX) 'e 


Substituting back in Eq. (5-38) gives 
a=XX=X(X’X) 'c 
and so the desired minimum variance linear unbiased estimator of ¢’B is 
m=aly 
= ¢(X’x) 'X’y 
=cb (5-40) 
that is, the unknown f is replaced by the OLS b. It follows directly that + 


1, Each OLS coefficient 5; is a best linear unbiased estimator of the correspond- 
ing population coefficient B;. 

2. The b.l.u.e. of any linear combination of f’s is that same linear combination 
of the b’s. 

3. The b.L.u.e. of E(Y,) is 


b, + bX, , + by Xy t+ + OLX, 


The Model in Deviation Form 


In Chaps. 2 and 3 it was seen that the two- and three-variable regression models 
could also be treated in deviation form. The essence of the approach was to first 
of all express all data in terms of deviations from sample means and then estimate 
the regression parameters in two stages, the first stage dealing with the slope 
coefficients and the second stage with the intercept term. The same treatment may 


+ As far as the disturbance u is concerned, the derivation of the Gauss-Markov result has only 
required the assumptions of zero mean and zero covariances and has not required the assumption of 
normality. 
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be applied to the k-variable model by use of the following transformation matrix: 
ee 
A=I- nll (5-41) 


where i denotes a column vector of m units. Thus 


1 1 Lgl 1 
A= --}1 1 1 
MN] ne sssesarcses 
1 I bonnet 1 
by! =r Y,] then 
) eens 
cL ip 
Y,-Y 
PS Aen? 
and ayy Ys ; 
y,-Y¥ 


Thus premultiplying any column vector of observations by A produces a vector 
showing those observations in deviation form. Two special cases are 


Ai=0 (5-42) 


or, more generally, premultiplying any vector of identical elements by A gives the 


zero vector. Second, 
Ae=e (5-43) 


for the residuals have zero mean, and are thus already in deviation form. It is 
easily verified that the A matrix is symmetric idempotent. 
The OLS estimator b and residual vector e are connected by 


y=Xbt+e (5-44) 
If we partition the X matrix as ‘ 
xX = [x, X,] 
where x,(= i) is the usual column of units and X, the n X (k — 1) matrix of 
observations on the variables X>, X3,..-, X,, we can rewrite Eq. (5-44) as 
(5-45) 


y =x,b, + Xb, te 
where b’ = [b, 4] indicates a conformable partitioning of the b vector into the 
intercept b, and the subvector b, of slope coefficients. Premultiplying Eq. (5-45) 
by A gives 

Ay = AX,b, + e€ 
using Eqs. (5-42) and (5-43). Premultiplying this by X yields 
XAy = X,AXpb, (5-46) 
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for Xe = 0 from Eg. (5-28). Finally, using the symmetric idempotency of A 
means that Eq. (5-46) is equivalent to 


(AX, )‘(Ay) = (AX, )’(AX, )b, (5-47) 
The interpretation of Eq. (5-47) is as follows: 


eb, is the subvector of OLS slope coefficients. 

Ay is the y vector expressed in deviation form. 

«AX, is the matrix of explanatory variables in deviation form. 

« Equation (5-47) is a set of normal equations [compare with Eq. (5-25)] in terms 
of deviations, whose solution yields the OLS slope coefficients. 

The remaining coefficient, which is the intercept term, is obtained by premulti- 
plying y = Xb + e by i’/n to yield 


b 
b, 
a. [1 X, X; XJ by 
by 
or 
b, = ¥ —b, X, — b,X, — ++» — bX, (5-48) 
The sum of squared deviations in the dependent variable, denoted by TSS, is 


TSS = y’Ay 
This may be decomposed into an explained sum of squares (ESS) and a residual 
sum of squares (RSS) in the manner of Chaps. 2 and 3. Return to 


Ay = AX,b, +e 
Transposing and multiplying, 
y’Ay =b}X4AX,b, + e’e (5-49) 
(TSS) (ESS) (RSS) 


since the cross-product term vanishes in view of X’e = 0. The multiple correlation 
coefficient Rj, ;..., for the k-variable case may then be defined in a number of 
alternative ways. The basic definition is 


ESS ee 
2 ae ee -50 
Ria aide ree ty Fas (5-50) 
In view of Eq. (5-49) this is equivalent to 
R= pA eae hy “a bXDAy (5-51) 
y/Ay yAy 


where the second expression follows from Eq. (5-46). Alternatively, we may start 
with the complete OLS regression 
y=Xbt+e 
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Transposing, multiplying, and again using X’e = 0 gives 


yy = D’X'Xb + ee (5-52) 
Using b = (X’X)~'X’y, equivalent expressions for Eq. (5-52) are 
yy =bXy+ee= y’X(X’X)'X’y +e'e (5-53) 


Comparing Eq. (5-52) with Eq. (5-49), the residual sum of squares, e’e is the same 
in each equation, since the OLS regression is unique and it makes no difference 
whether we fit the complete regression directly to the original data or transform 
the data into deviation form and compute the slope coefficients followed by the 
intercept. The left-hand sides differ only’ in that 


yy= Ly? 
and y‘Ay = L(%- ¥ \ 
= DY? - nF? 
=yyon¥? 


Subtracting the correction for the mean nY* from both sides of Eq. (5-52) gives 


(y'y — n¥?) =(b'X’Xb' — nY*) + ee 
(Tss) (Ess) (RSS) 


Thus the previous expressions for R in terms of sums of squares may all be 
computed in terms of the original data, provided only that the correction for the 
mean is subtracted from any total or explained sum of squares (but not from the 
residual sum of squares). 

It is sometimes useful to compute an R?, adjusted for degrees of freedom, 
especially when comparing the explanatory power of different numbers of ex- 
planatory variables. Adding any extra explanatory variable can never increase the 
residual sum of squares and thus can never decrease the R? defined in Eq. (5-50), 
since that expression takes no account of the number of explanatory variables 
employed. It may be rewritten as 


Rinse = 1 y’Ay/n 


The adjusted R? is defined as 

ee/(n—k) 

yAy/(n — 1) 

The rationale behind the adjustment is that k parameters have been used in fitting 
the regression plane from which the residual sum of squares is measured, and one 
parameter, the sample mean, has been estimated in computing TSS. As will be 
seen later, these provide unbiased estimators of the disturbance variance and the 
Y variance. Equivalent expressions for the adjusted coefficient are 

teen 

BS (1 - Riss.-+) 


1-k ,n-1 
=a ant poRRia 


Riggs 


Rias.--k = 1 = 
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It is thus possible for the adjusted coefficient to decline if an additional variable 
produces too small a reduction in 1 — R? to compensate for the increase in 


(n - 1)/(n— k). 


Example 5-1 To help fix some of these concepts, here is a brief numerical 
example. The numbers have been kept artificially simple so as not to obscure 
the nature of the operations with cumbersome arithmetic. The sample data 
are 


y= and X= 


AwWonw 
NUH Ww 
Ahan” 


where we have already inserted a column of units in the first column of X. 
From these data we readily compute 


> lo- 25 20 
XX =) 15) 255 .. 81 and =Xy=| 76 
PSUR gd 109 
The normal equations of Eq. (5-25) are then 
62 HISHves25. ib, 20 
15 55 81|/b,;=| 76 


25 81 129]]b,| | 109 


Rather than invert (X’X) we will solve these equations by the elimination 
method. In the first step subtract three times the first row from the second 
and five times the first row from the third. This gives the revised system 


5 15° 25][b, 20 
0 10 6] b,}=| 16 
0 6 44/6, 9 


Next subtract six-tenths of row 2 from row 3 to get 


5 15 25 |[b, 20 
0 10 6 |lb,|=] 16 
0 0 04]/ 4, OIG 


The third equation gives 0.45, = —0.6, that is, 


Substituting for 6; in the second equation, 
106, + 6b; = 16 


THE k-VARIABLE LINEAR MODEL 179 


gives 


Finally, the first equation 
5b, + 15b, + 25b, = 20 


The regression equation is thus 


¥ =4 + 2.5X, — 1.5%; 


Alternatively, transforming the data into deviation form gives 


gives 


=i On 10 
=3 =2 -1 

Ay=| 4| ‘and AX=/| 2 1 
=I =) =1 

1 hon al 


The normal Eqs. (5-46) now become 


(ats) [5] 


The observant reader will notice that these are the second and third equations 
obtained in the first step of the elimination method above.} 

Thus the solutions for b, and 5; will coincide with those already ob- 
tained. Likewise, b, will be the same as before, for the final equation in the 
back substitution above is readily seen to be 


b, = ¥—b,X, — bX; 
Thus the elimination process applied to (X’X)b = X’y is, in fact, equivalent to 


transforming the data into deviation form and proceeding in two-step fash- 


ion. 
To calculate R? we note from the Ay vector that 


TSS = y’Ay = 28 


+ The reason why may be seen as follows: 
n LX, rx; 
XX=]EX, LXZ Lx; 
EX, L%%; DAF 
To produce a zero in the (2, 1) position, we must subtract LX2/n (= ¥,) times the first row from the 


second, In the (2,2) position this gives EX} — nXz = EX — X,)?, and in the (2, 3) position it gives 
LX, X3 — nX,X3 = LX) — X,)(X, — Xj). A similar argument applies to the transformed third row. 
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Calculating the explained sum of squares from bj X’, Ay gives 


ESS=[25 - 1s]{'S] = 265 


Thus 
RSS = 28 — 26.5 = 1.5 
26.5 
2 
and R= ®B 0.9464 


so that the regression has accounted for almost 95 percent of the variance of 
Y. As a check we may calculate the explained sum of squares from b’X’y by 
subtracting the correction for the mean, 


bxy=[4 2.5 -1.5] 


20 
76 | = 106.5 
109 

, n¥? = 5(4)° = 80 
Thus ESS = b’X'y — nY? = 26.5 
in agreement with the previous calculation. 


Estimation of o” 


Finally in this section we derive an estimator of 67, the variance of the dis- 
turbance term. As the values of u are not directly observable, it seems plausible to 
base an estimate of o? on the residual sum of squares e’e. The only question is 
what should the divisor be, and this can be settled by requiring the estimator to 
be unbiased. We have 


e=y—Xb 
= y — X(X'X)'Xy 
= [1- xxx) "xy 
= My (5-54) 
where 
M = I- X(X’X)"'x’ (5-55) 


M is a very important matrix. It is easily verified that it is symmetric idempotent. 
It also follows directly by multiplying out that 


MX = 0 (5-56) 
Returning to Eq. (5-54), 
e = M(XB + u) 
that is, e= Mu 


in view of Eq. (5-56). From the symmetric idempotency of M, 
e’e = u'Mu 
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Taking expectations 
E(e’e) = E(u’Mu) 
= E{tr(u’Mu)} since u’Muis a scalar 
= E{tr(Muu’)) from Eq. (4-15) 
=o? tM by assumption 3 
From Eq. (5-55) 
tr(M) = tr(1) — tr[X(X’x) 'x’] 
= tr(I) — tr[ (XX) 'x’x] 


=n-k 
Thus if we define 
sae (5-57) 


it follows that 
E(s*) =o? 


and we have found the desired unbiased estimator. The square root s is often 
referred to as the standard error of the estimate, and may be regarded as the 
standard deviation of the Y values about the regression plane. 


5-4 INFERENCE IN THE OLS MODEL 
So far we have not used the assumption that the w’s are multivariate normal, but 
this now becomes necessary. We now make the twin assumptions 
u~ N(0, 071) 
and X is nonstochastic with rank k 


The first is a combination of assumptions 2, 3, and 6, and the second is a 
combination of assumptions 4 and 5. We have seen in Eq. (5-31) that 

b=B + (XX) ‘Xu 
so b is then multivariate normal, and since we have already established the mean 
vector and the covariance matrix, we have the fundamental result 


b ~ N(B,02(X’X) ') (5-58) 
From the end of the previous section we also have 
e’e = uMu 


From the result in Sec. 5-1 on the distribution of quadratic forms with idempo- 
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tent matrices it then follows that 
(ee) ~x?(n—k) (5-59) 
o 


The degrees of freedom in Eq. (5-59) come from the fact that 
p(M) = tr(M) 

since M is idempotent, and we have just shown the trace to be n — k. Finally, 
applying condition (5-15) for the independence of a linear and quadratic form to 
b and e’e gives 

(x’X)'X’M = 0 (5-60) 
since MX = 0. Thus 

b is distributed independently of s?. 

These results suffice to establish inference procedures for any element of b. 
Consider, for example, b,, the estimated coefficient of X, in the OLS regression. It 
follows from Eq, (5-58) that 

b, ~ N(B;, 07a,;) 


where a,, denotes the ith element on the principal diagonal of (X’X)~ ' Thus 
b,- 
b= B ~ N(0,1) 

oya 


ii 
From Eggs. (5-59) and (5-60), 
(n — k)s? 
wae “3 x2(n vo k) 
independently of b,. Thus we can proceed directly to form a f variable, that is, 


— B oy(n — k) 
ofa, s\(n —k) 


t 


or 
b, — B; 
Fee iate eaese fori = 1,2,...,k (5-61) 
sai, 
Result (5-61) may be used to test an hypothesis about B, or set up a confidence 
interval for B, in the usual way. However, we will not pursue the details further at 
the moment as it is more efficient to develop a general set of inference procedures, 


of which tests on a single coefficient are just one particular application. 


Sets of Linear Hypotheses 


Consider the set of linear hypotheses about the elements of B, embodied in 
RB=r (5-62) 


where R is a known matrix of order q X k with g < k, andr is a known q-element 
vector. We also assume R to have full row rank, that is, that there are no linear 
dependencies between the hypotheses. It is extremely important to understand the 
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range of various hypotheses represented by Eq. (5-62). We illustrate them with 
some examples. 


1 R=[0 -*=. OMDVORS=anG] and r=0 
Here R contains only a single row (gq = 1) with a unit in the ith position and 
zeros everywhere else, and r is the scalar zero. On substitution in Eq. (5-62) 
we have 

B,=0 

Thus this specification of R and r sets up the hypothesis that 8; is zero. 
Choosing a nonzero value for r would set up the hypothesis that B; is equal to 
the specified constant. 

2,.R=[0 1 -1 -:+ 0} and r=0 
produce the hypothesis 


B, — B; = 0 
GE B, = B, 
3. R = [0,08 hyedl 0 per el and r=! 
specify the hypothesis 
B; + By 1 
4. 
0 1 0 0 
R=1108 OFM 0 
60 0 se 1 
of order (kK — 1) X k and 
0 
0 
Poh 
0 
of order (k — 1) X 1. This is equivalent to the joint hypothesis 
B, 0 
B,| | 
By 9 


that is, that the set of explanatory variables X5, X3,-+-, X, has no influence in 
the determination of Y. This is a very important hypothesis. The test of this 
hypothesis is often referred to as a fest of the overall relation. Notice that the 
hypothesis does not include B, = 0, since that involves the additional implica- 
tion that the mean level of Y is zero. Our usual concern is whether the 
hypothetical explanatory variables help to explain the variation of Y around 
ity mean value’ but the actual level of the mean is of no particular importance. 
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5. R=[0 1,] and r=0 
Here 0 is a null matrix of order s X (k — s) and r is an s-element column 
vector. This sets up the hypothesis that the last s elements in £ are jointly 
zero, 


BER Rien = B= 0 


For example, in an equation explaining the rate of inflation the explanatory 
variables might be grouped into two subsets—those measuring expectations 
of inflation and those measuring pressure of demand. The significance of 
either subset might be tested by using this formulation with the numbering of 
the variables so arranged that those in the subset to be tested come at the end. 


It is thus clear that a procedure for testing the general hypothesis Rf = r will 
be extremely useful and powerful, since various specifications for R and r will 
cover a range of questions. 

To develop such a test procedure, we first of all replace the unknown B 
vector in Eq. (5-62) by the OLS vector b, obtaining the vector Rb. The more this 
vector departs from r, the greater is the doubt cast on the hypothesis. The 
problem is to determine the sampling distribution of Rb and devise a practical 
test procedure. First of all, we see directly that 

E(Rb) = RB (5-63) 
and 
var(Rb) = E{R(b — B)(b — B)’R’} 
= 0?R(X’X) 'R’ (5-64) 
Since b is multivariate normal, 


Rb ~ N(RB, 0?R(X’X) 'R’) 


or R(b — B) ~ N(0,0?R(X’X) _'R’) (5-65) 
If the hypothesis (5-62) is true, we can replace RB in Eq. (5-65) by r, obtaining 
(Rb = r) ~ N(0,02R(X’X)"'R’) (5-66) 

We can now apply Eq. (5-11) directly to Eq. (5-66) and write 
(Rb = r)'[o?R(X’X)'R’] '(Rb = r) ~ x2(q) (5-67) 


where the degrees of freedom q are given by the number of elements in the Rb 
vector.} : 

The only problem hindering practical applications of Eq. (5-67) is the 
presence of the unknown o”, since all other elements are known. However, we 


+ To show that the inverse of R(X’X)~'R’ exists, we show that 
7R(X'X) 'R2>0 


for z + 0, so that the matrix is positive definite and thus nonsingular, Define vy = R’z. Then ¥ isa 
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have already shown that 

ee 
Sok) 
oO 


independently of b, and hence independently of Rb. Thus we can form an F ratio, 
and the unknown o? will cancel out. The basic result is thus, if RB = r is true, 


(Rb — r)'[R(X’X) 'R'] ‘(Rb -1)/a 


7eLH ~ F(q,n—k) (5-68) 


The test procedure is then to reject the hypothesis RB = r if the computed F value 
exceeds a preselected critical value. Now we must see what this test procedure 
amounts to in some of the specific applications indicated above. 


Testing a Single Coefficient 
Rea! [OF Ce AEG IND) Hire 0) and r=0 


ith element 


sets up the hypothesis 


Hy: B,=0 
We then have 

Rb-r=5), 
and R(X’X)'R’ 


merely picks out the ith element a,, on the main diagonal of (X’X)"'. Thus the 
test statistic becomes 
2 
pot F(n-k) (5-69) 
s7aj, 


If instead of testing the hypothesis 


B,=0 
one wishes to test the hypothesis that £; assumed some specified value, 
B= Bio 


linear combination of the rows of R. Since R has full row rank, v * 0. Thus 


ZR(X'X)'R’2 = V(X'X) 'V 
But (X’X) is positive definite by assumption, and so (XX)! is positive definite, since its eigenvalues 
are the reciprocals of the eigenvalues of (X’X). Thus 
v(xx) 'v>0 


and so R(X’X) ~'R’ is positive definite. 
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we simply set r = Bio, and the test statistic becomes 


2 
S°a;; 


This is, of course, the same result as that already derived by a different route in 
Eq. (5-61), since 1?(n — k) = F(1,n — k). 


Testing the Significance of the Complete Regression 


Os he 0 hye esis 0 

R=|0 0 1 0 

OiriO sO; 0 
of order (k — 1) x k and r = 0. The hypothesis now is 
B, = By= > = B= 0 


The vector Rb — r now reduces simply to the k — 1 vector of OLS regression 
slopes 


R(X’X)~'R’ picks out the (kK — 1) X (k — 1) submatrix formed by the last k — 1 
rows and columns in (X’X)~'. To see what is implied by this matrix, partition the 
X matrix as 


X=]i x3) 


where i is a column of units and X, is the n x (k — 1) matrix of observations on 

all the explanatory variables. Then 
n iX 
xX= ; 

ss i a 

By Eq. (4-66) the right lower k — 1 submatrix in (X’'X) >! is 
1 ae 2 
(xx, = Xi 71%) = (X,AX,)7! 


from Eq. (5-41). Thus the F statistic becomes 


= OGAX )b/(k = 1) : 
F= de /ttienk) (5-70) 


From the decomposition of the total sum of squares in Eq. (5-49) above this is 
seen to be 


p= ESS/(k = 1) 


~ RSS/(n — k) (5-71) 
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or, in terms of R?, 
sn are 
(1 — R?)/(n—k) 
The joint significance of the complete set of explanatory variables is thus tested by 


computing F from any of these three formulas and seeing whether the computed 
value exceeds the preselected critical value. 


(5-72) 


Testing the Significance of a Subset of Coefficients 
Specifying 
R=[0 1] and r=0 
sets up the hypothesis 
Bear Pee B, =0 


We can always renumber the variables, if necessary, so that the subset of interest 
comes at the end. Let us partition X and b conformably so that the complete OLS 


regression may be written 


b, 
y=[X, xl? 


where X, is of order n x (k — s) and denotes the first k — s columns in X, and X, 
denotes the last s columns in X. Now 

Rb-r=b, 
and R(X’X)~'R’ picks out the square submatrix of order s in the bottom 
right-hand corner of (X’X)~ 1 Let us call that submatrix C,,. From the partition- 
ing of X above 


+e=X,b+X,b,+e (5-73) 


; KPA NY 
2) = Lex, Xa 
and from Eq. (4-66) 
ve -1 
©, = (XX, — X,X,(XX,)"'X,%,) 


= (x) [1 x,00x,) '@)X) 


= (X{M,X,) (5-74) 
where 
M, =1- X,(X;X,) 'X; (5-75) 
Thus the numerator in the test statistic, Eg. (5-68), becomes} 
b/(X.M,X, )b,/s 


+Do not confuse this s, which denotes the number of coefficients under test with the square root of 
the residual variance defined in Eq. (5-57). 
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We will now show that this numerator has a very fundamental and important 
interpretation in terms of sums of squares. Suppose y is regressed just on the 
subset of variables in X,. Let e, denote the resultant vector of residuals. From Eq. 
(5-54) we have 

e,= My 
where M, is exactly the matrix just defined in Eq. (5-75). 
Thus 


ee,=residual sum of squares from regression of y on X, 
e’e=residual sum of squares from regression of y on[X, X,] 
ee, — e’e= reduction in residual sum of squares due to adding X, to regression 
=increase in explained sum of squares due to adding X, to regression 


Our purpose is to show that 
b;(X,M,X, )b, = ee, — e’e 
Return to Eq. (5-73) and premultiply by M, to get 
M,y = M,X,b, + M,X,b, + Me 
= M,X,b, + e 
for the definition of M, in Eq. (5-75) implies M,X, = 0 and M,e = e since 
X’e = [Xie Xe] = [0 0]. Transposing and squaring this equation gives 
y’M,y = b(XM,X,b, + e’e 
but y'M,y = ee, 
and the desired result follows. Thus the test statistic, Eq. (5-68), in this case 
becomes 
pu = ee)/s 
(e’e)/(n — k) 
In words, the test of the joint significance of the subset X, is achieved by the 
following steps: 


~ F(s,n—k) (5-76) 


1. Regress y on the variables X, which are not in the subset, and measure the 
residual sum of squares e’e,. 

2. Carry out the complete regression and measure the residual sum of squares 
e’e. The difference ee, — e’e measures the reduction in the residual sum of 
Squares due to adding X, to the regression. 

3. The mean square (e’e, — e’e)/s, associated with the subset, is then contrasted 
with the overall mean square e’e/(n — k). If the resultant F value exceeds a 
preselected critical value, the hypothesis that the variables in X, have zero 
effect on Y is rejected. 


The previous test for the joint significance of all the explanatory variables 
may also be seen to be of the same form as Eq. (5-76). That test was based on 
ESS/(k — 1) 


P= RSS/(n—k) 
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From Eqs. (5-49) we have 


y’Ay =b,X5AX,b, + ee 
(TSS) (ESS) (RSS) 


Thus the above F statistic could be written 
_ (Ay — ee)/(k = 1) 
e’e/(n—k) O 
and y’Ay, which is the sum of the squared deviations of the Y values, can be 


interpreted as a residual sum of squares when Y is regressed only on a vector of 
units i, for replacing X, in Eq. (5-75) by i gives 


F 


Mo =1— le 
n 


This is the A matrix of Eq. (5-41), which transforms a variable into deviation 
form. Thus e’e, becomes y’Ay in this case. 

The test of a single coefficient is merely a special case of the test of a subset. 
Thus the ¢ or F test for the significance of a single coefficient may also be 
interpreted in a sums of squares context. The test of 

Hy: B= 0 


amounts to 


Regress y on all X’s except X;. 
. Regress y on all X’s. 
3. Compute the reduction in the residual sum of squares from step | to step 2 


and contrast with e’e/(n — k). 


Lag 


Finally, we note another illuminating way of interpreting these various tests. 
The regression of y on X, leading to the residual vector e, may be regarded as a 
restricted regression. The essence of the restriction is that any coefficients which 
are specified by the hypothesis to be zero in the population are actually set at zero 
in the sample. Thus in testing 8, = 0, the restricted regression omits X,, so that in 
effect b, = 0. Likewise, in testing the overall regression, the restricted regression 
leaves out all variables except the unit vector, thus setting b) = Bee by =0. 
The complete regression may be regarded as an unrestricted regression, since all 
the variables are included, and the estimated coefficients come out as the sample 
data determine. Thus 


e’e, = residual sum of squares from restricted regression 
e’e = residual sum of squares from unrestricted regression 
and the test of the significance of a restriction, or the set of g restrictions, is 


_ (ee, = ee)/a (5-77) 
ee/(n—k) 
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Confidence Intervals 


Confidence intervals for a single f coefficient may be readily determined from the 
result on the f distribution in Eq. (5-61). Joint confidence regions for two or more 
parameters may also be determined. From Eq. (5-65) we have 
& =1 2 
[R(b — B)]'[o?R(X’X) 'R’] '[R(> — B)] ~ x*(a) 
and, as usual, 
ee 
is Lt Sa) 
0 
independently of b. Thus 
n 7 U7 att 
_ [R(> — B))[R(x) 'R] '[R( - B)]/a _ 
e’e/(n—k) 
Appropriate specifications of R in Eq. (5-78) will yield confidence regions for 
various groups of parameters. For example, setting R = J, and equating the 


eynression in Eq. (5-78) to some critical value F, gives a condition on the 
u..<nown B vector from which a joint confidence region may be determined. 


F F(q,n—k) (5-78) 


Example 5-2. Example 5-1 was based on 


and X= 


“< 
i 
UwWorw 
PNUw 
DARADY 


The estimated regression was 
Y¥ =4 + 2.5X, - 1.5X, 
with ESS = 26.5, RSS = 1.5, TSS = 28, and 
R? = 0.9464 
We also have n = 5 and k = 3. We will now illustrate tests of various hypotheses 
with these data. It must be emphasized that the tests are simply meant to 
illustrate the use of the formulas of this section. The data have been “cooked” to 


give simple numbers, and the sample size is too small to allow any sharp 
interpretations. 


1, Testing the joint significance of X, and X, 


Substitution in Eq, (5-71) gives 


_ ESS/(k'=1) © 26.5/(3 1) _ 


Si RSS/(n—k)  1.5/(5-3) Has! 
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From the tables of the F distribution, Fy 45(2,2) = 19.00, so that the sample F 

falls short of the 5 percent critical value. Even though the sample R? is 

numerically high, the sample size is so small that it fails to reach significance. 
. Testing the significance of X; 


Result (5-61) states that 


pe AEs 8 2) 


sya 


ii 


where a,, is the ith term on the main diagonal of (X’X)~'. We do not need, 
however, to invert the 3 x 3 matrix X’X. In the development of Eq. (5-70) we 
showed: that the right lower k — 1 submatrix in (X’X)~! is given by 
(X,AX,) |, which is simply the inverse of the matrix of sums of squares and 
products of the variables in deviation form. For this example we have 


, 10 6 
X,AX, = na 
Thus 
” =a 1 =1.5 
a | 5: 3125 
giving a3, = 2.5. Further, s? = ee/(n — k) = 1.5/2 = 0.75. Finally, sub- 
stituting — 1.5 for b; and 0 for B; gives the test statistic 
-1.5 


ee D1 
V0.75 V2.5 


which is insignificant. 


Alternatively, we may show that the same numerical value for the test statistic 
comes from the stepwise reduction in the residual sum of squares. It is again 
simpler to work with the data in deviation form. Regressing Y on X, gives an 
estimated regression coefficient of 


The explained sum of squares due to this regression is then 
b,Lyxz = 1.6(16) = 25.6 
and the residual sum of squares is 
Ey? — byLyx, = 28.0 — 25.6 = 2.4 = ee, 
When the complete regression X, and X, is run, we already know the residual 
sum of squares, 
RSS = e’e = 15 
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Thus substitution in Eq. (5-77) gives 


(ee, — ee)/q_2.4-1.5 
pe NR ENE PANS A i3 9 
Be ere srs 


which is the square of the above ¢ statistic, as it should be. 
3. Coefficients of X, and X, equal in magnitude but opposite in sign 


A, + By 


Hy 


From the general formulation 
RB =r 
this gives 
R=[0 1 1] and r=0 
with q = 1. The appropriate test statistic is given by the general result in Eq. 
(5-68), namely, 
_ (Rb = ry [R(X’X)"'R’] |(Rb = 1) /q 


i ee/(n—k) 


We then have 
Rb—-r=b, +b, 
and R(X’X)~'R only involves the elements in the 2 x 2 submatrix in the 


lower right-hand corner of (X’X)~'. As we have already seen, this is 
(X,AX,)~!. Thus 


R(X’X) 'R’ = [1 a ov, ee lab 


Thus the test statistic becomes 


_ (b+ by y 

~ 0.75(0.5) 266 
which falls well short of any usual critical value for F(1,2). Thus the data are 
not inconsistent with the hypothesis that B, + B; = 0. 


4. 95 percent confidence interval for By 


b= 25 
st=F=0.75 

The top diagonal element in (X,AX,)~' = 1. Thus 
var(b,) = 0.75 
s.e. (b,) = 0.866 


fo.025(2) = 4.303 
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Thus the confidence interval is given by 
2.5 + 4.303(0.866) = 2.5 + 3.7 
that is, 
—1.2t0 6.2 
The fact that the confidence interval includes zero means that b, is not 
statistically significant at the 5 percent level. 


5. Joint confidence region for B, and B, Returning to Eq. (5-78) we specify 


_fo 1 0 a 
r-[} 0 4 so) el Bad 


no -0)=[2]-[8]-[ 3-8 


[R(x’x) 'R] ‘= Ke Al 


Substitution in Eq. (5-78) gives 


Thus 


and 


5) 
cers alle | =15- ‘| 
a 15 
26.5 — 32By — 1885 + 128285 + 1083 + 483 
iy 15 


Choosing, say, the 5 percent critical value of F, we have 
Pr( F < Fos) = 0.95 


Then setting 

F = Foos 
defines a 95 percent confidence ellipse for the unknown B parameters in F. 
For this problem Fy 9s(2, 2) = 19. Setting F = Fog; then gives 


10B2 + 12B,A, + 483 — 32B, ~ 188s + 265 _ 19-9 
15 


that is, 

108? + 12,8; + 4p2 — 328, — 188; — 2 = 0 
This defines the 95 percent confidence ellipse for B, and 3, which is sketched 
in Fig. 5-1. The ellipse is centered at the estimated point by = 2.5, by = 1:5; 
There is a strong negative covariance between the two estimates and the 
origin lies just inside the ellipse, in agreement with the result of test 1 above. 


6. Point and interval forecasts Suppose we wish to forecast the value of Y 
associated with X, = 10 and X; = 10. Plugging these values into the regres- 
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Figure 5-1 95 percent confidence ellipse for b,, b;. 


sion equation gives the point forecast 
¥, = 4 + 2.5(10) — 1.5(10) = 14 
A point forecast is of little use unless supplemented by a measure of 
precision, which enables us to put the forecast in interval form. The forecast 
may be written 
b 
¥,=[1 10 10]] 6, |=Rb 
bs 
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where R=[1 10 10] 
The actual Y value in the forecast period will be 
Y,= RB + uy 


where u, denotes the actual value assumed by the disturbance in the forecast 
period. Let us then define the forecast error e, as 


=a ¥,= —R(b— B) + u, 
It follows immediately that 
E(e,;) =0 
since E(b) = B and E(u,) = 0. Also 
var(e;) = E{[-R(b —B)+ uy|[-R —B)+ u,]} 
= 0?R(X’X) 'R’ + 0? 
using Eq. (5-64), and also the fact that u, will be independent of the sample 
disturbances and hence independent of b. Thus 
¥-% 
oV1 + R(X’X) 'R’ 


Replacing the unknown o by 
s = je'e/(n —k) 


~ N(0,1) 


then gives re 
¥-% 


s/1 + R(X’X) 'R’ 
and so a 95 percent confidence interval for Y; is 


Y+ to oossV1 + R(X’X) RY (5-79) 


where R is a row vector containing the values of X in the forecast period 
prefaced by unity in the first position, In this example 


~t(n-k) 


SUS 25 
XX=|15 55 81 
Rl 129 


with 
26.7 45 -8.0 

(xx) '=| 45 10 -15 
Hg EAS 712.5 


Thus 
2.7 ©8645 -8.0]{ 1 
R(XX)'R’'=[1 10 10]} 45 10 —15 10 
BG Sle ee sat 


1 
=[-83 -0.5 2010] 62 
10 
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We also have 
s* = 0.75 
and 
to,025(2) = 4.303 


Thus substitution in Eq. (5-79) gives 


14 + 4.303V0.75 V7.7 = 14 + 10.34 


66 to 24.34 
or 


This is a prediction interval for ¥,, the value of Y in the forecast period. 
Sometimes an investigator prefers to set up an interval for E( Y,), that is, the 
mean or expected value of Y in the forecast period, the reason being that Y; 
contains the disturbance u,, which is essentially unpredictable. We have 


= RB + uy 
Thus E(¥,) = RB 


and the forecast error would now be defined as 
¢,= E(%) ~ ¥,= =R(b - B) 


Following through the steps of the previous analysis then gives a 95 percent 
confidence interval for E(¥,) as 


¥, + too2ssVR(X’X) 'R’ (5-80) 


The numerical implementation of Eq. (5-80) gives 


14 + 4.303V0.75 V6.7 


or 4.36 to 23.64 


which is a slightly narrower interval than that for Y,. 


There is an alternative way of generating either interval forecast, which is 
simpler in that it only requires the inversion of a second-order rather than a 
third-order matrix, and which also provides an illuminating way of looking at 
the OLS regression. From Eq. (5-48) we can write the OLS regression as 


Y=Y+b,x, + jx; +--+ + bx, +6 


where, as in Chaps. 2 and 3, x, = X, — X, and so on. This is equivalent to 
the regression of y on X, where X is now defined as 


X=[i AX,] 
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i being a column of units and AX, the n x (k — 1) matrix of deviations, The 
OLS estimator is then 


-1 
0 i'y 


= 


0 XAX,| | XAy 


0 iy 


i 
co xl 


(X,AX,) | || XoAy 


Y 


" 


(X,AX,) X,Ay 


where we used the result that i/AX, = 0 (the sums of sample deviations being 
identically zero). The covariance matrix is 


= 0 
var(b) = 07| ” 
0 (X,AX) | 
The point forecast may be written 
Y= Y +x,b, 
where x,=[X2y “** Xk, ] is a row vector of the X deviations in the 


forecast period and b, is a (k — 1)element column vector of the OLS 
regression slopes. Thus 


E(¥,) = E(Y) + xP 
and 


var(¥,) = var(Y ) ae x,E{(b, Th B, )(b, cs B,)')x; 
‘Ss ot es x/(X4AX2)'X;] 


since the matrix var(b) above shows that Y and b, are distributed indepen- 
dently. For the problem in hand, 
YS Ae iy ae X= 5 


and so 
x,=[7 5] 
Peta TO ts 
(%24%:) - [18 25 


and s? = 0.75. Thus the estimated var(¥,) is 
1.0 -15][7] = 0.75(6.7 
0.75(0.2) +0.73(7 5]| _19 1 51[7] - 0.15(6.7) 
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The point forecast is 
oo al 
¥=4+[7 a(t 23 14 


and thus the 95 percent confidence interval for E(¥,) is 


14 + 4.303V0.75 v0.67 
or 4.36 to 23.64 


as before. 


Prediction when the X Variables Are Uncertain 


The treatment of interval forecasts given above assumes implicitly that the values 
of the X variables in the forecast period are known with certainty. In practice, 
however, it is more realistic to postulate some uncertainty about the X values. Let 


x)= [1 Xap Xap oor Ky] 
indicate the true values of the explanatory variables in the forecast period and 
R= [1 hy hy Ky] 


the values that the forecaster thinks will be obtained in the forecast period. The 
true value Y; is given by 


Y= x/B + uy 
and the point prediction will now be 
Y= Xb 
Thus the forecast error is 
e=¥-¥ 
=u, + xB — 4b 


For simplicity we will drop the f subscript, since there is no ambiguity, and write 
the forecast error as 


e=u—%(b— B) — (R- x)B 


=u — %(b — B) — B’(& — x) (5-81) 
If we assume that the forecaster makes unbiased forecasts of the X values, that is, 
E(&) =x 


and, in addition, that there is zero covariance in the population between forecasts 
of x and estimate of B from the sample data, then 


E(X’(b — B)} =0 
and so 
E(e)=0 
Hence the variance of the forecast error is found by squaring Eq. (5-81) and 
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taking expectations, to give 

o? = E{u? + &(b — B)(b— B)'% + B(X — x)(& — x)/B + cross-product terms) 
(5-82) 

On taking expectations the cross-product terms vanish because of the indepen- 

dence of u, %, and b. For the remaining terms 


E{B'(& — x)(& — x)B} = B'E((& — x)(& ~ x)9B 


= B’var(%)B 
and 
E({8"(b — B)(b — B)'8) = E(tr[8"(b — B)(b — BY'8]) 
= E(t[(b — B)(b — B)'8*’]) 
= tr[ E((b ~ B)(b — BY ER) 
Now 
var(%) = E((& — x)(& ~ x)’} 
= E(&8’) — xx’ 
and so 
E(&8’) = var(&) + xx’ 
Thus 
E(&(b — B)(b — B)’&} = tr[var(b) - (var(&) + xx’)] 
and 


tr[var(b)xx’] = tr[x’var(b)x] = x’var(b)x 
Thus substituting these expressions in Eq. (5-82) 
o? = of + x'var(b)x + B’var(&)B + tr[var(b) - var(%)] (5-83) 


If there were no uncertainity about the X values in the forecast period, this 


expression reduces to 
o2 + x’var(b)x 


which is the conventional formula for the variance of a forecast. To implement 


Eq. (5-83) the various unknowns are replaced by estimated values: 


«0, is replaced by s? = ee/(n — k) in the usual fashion. 
x is replaced by %. 

«var(b) is estimated by the usual OLS program. 

+ B is replaced by the estimated b. 


y is likely to be in the estimation of var(%), the 
f the forecast X values. There may be accumulated 
riances and covariances may be estimated. 


The main practical difficult 
variance-covariance matrix 0} 
experience in forecasting from which vai 
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Alternatively, the forecaster may have subjective assessments that a forecast value 
is very likely to be within, say, 5 percent of the true value, which in turn implies a 
figure for the variance. 

The remaining practical difficulty about the use of Eq. (5-83) is that we can 
no longer determine exact confidence intervals using the ¢ and normal distribu- 
tions. The reason is that even if normality is assumed for & as well as u, the 
forecast error in Eq. (5-81) is not normally distributed since it involves %(b — B), 
which is the sum of products of normal variables. One may follow the suggestion 
of Feldstein to use the Chebyshev inequality to determine an outer-bound 
forecast interval.+ The practical procedure is as follows. Letting sy denote the 
square root of the estimated value of Eq. (5-83) we can state: 


The probability that the observed value of Y in the forecast period will fall 
outside the interval Y, + cs does not exceed | ice: 


The researcher can set the value of c to make 1/c? equal to 0.05 or whatever 
is desired. The Chebyshev inequality strictly involves the true o,, but it is a very 
conservative statement and unlikely to be seriously affected by the replacement of 
0; by sy. If the distribution of y, were sufficiently well behaved to be unimodal 
and symmetric, the probability of Y, lying outside the interval y, + csy would not 
exceed 4/9c?. 


PROBLEMS 


5-1 Test the hypotheses (N.B. plural) 8, = 1, 8) = 1, 8; = —2 in the regression model 
¥, = Bo + BX, + ByXa, + By Xs, + uy 
given the following sums of squares and products of deviations from means for 24 observations: 


Ly? =60 Ix} = 10 Lx} = 30 Ix} = 20 
Lyx, =7 Lyx,=-7 Lyx; = —26 
Lx,x. = 10 Lx,x3 = 5 Lx2x; = 15 


Test also the hypothesis that 8, + B, + B; = 0. How does this differ from the hypothesis that 
{B, B, B3]/=[1 1 —2)? Test the latter hypothesis. 
5-2 The following sums were obtained from 10 sets of observations on ¥, X;, and Xp: 


LY = 20 LX, = 30 Lx, = 40 
LY? = 88.2 DXP = 92 LX} = 163 
LYX, = 59 LYX, = 88 EX, X, = 119 


Estimate the regression of Y on X, and X>, and test the hypothesis that the coefficient of X3 is zero. 
5-3 Let 


y be ann X | column vector 


X ann X k matrix 


+™M. S. Feldstein, “The Error of Forecast in Econometric Models when the Forecast-Period 
Exogenous Variables are Stochastic,” Econometrica, 39, 1971, pp. 55-60. 
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X{i] be X with the ith column (x,) removed 
e,, be the residual vector from the regression of y on X{/] 
e, be the residual vector from the regression of x, on X{/] 


Now consider the two regressions: 


1. e,, on e, 
2. yonX 


Prove that: 
(a) The slope b = e},e,/eje; from regression | and the multiple regression coefficient b; from 
regression 2 are identical. 
(b) The residuals from the two regressions are identical, 
(c) The simple correlation between e,; and e; is the same as the partial correlation between y and 
x, in regression 2. 
5-4 The following regression equation is estimated as a production function for Q: 


log Q = 1.37 + 0.632 log K + 0.452 log L 
(0.257) (0.219) 


R2=0,98 — cov( by, by) = 0.055 


and the standard errors are given in parentheses. 

Test the following null hypotheses: 

(a) The capital K and labor L elasticities of output are identical. 

(b) There are constant returhs to scale. 

(University of Washington, 1980) 

Note: The problem does not give the number of sample observations. Does this omission affect 
your conclusions? 
5-5 Consider a multiple regression model for which all classical assumptions hold, but in which there 
is no constant term. Suppose you wish to test the null hypothesis that there is no relationship between y 
and X, that is, 

Ho: B= *"* =Bk=0 


against the alternative that at least one of the B’s is nonzero. Present the appropriate test statistic and 
state its distribution (including the appropriate number(s] of degrees of freedom). 
(University of Michigan, 1978) 


5-6 One aspect of the rational expectations hypothesis involves the claim that expectations are 
unbiased, that is, that the average prediction is equal to the observed realization of the variable under 
investigation. This claim can be tested by reference to announced predictions and to actual values of 
the rate of interest on three-month U.S. Treasury Bills published in The Goldsmith-Nagan Bond and 
Money Market Letter. The results of least-squares estimation (based on 30 quarterly observations) of 


the regression of the actual on the predicted interest rates were as follows: 
r,= 0.24 + 0.94 1% + €,, RSS = 28.56 
(0.86) (0.14) 
and r* is the average expectation of r, held at the end of the 


where r, is the observed interest rate, ois 
preceding quarter. Figures in parentheses are estimated standard errors. The sample data on r* give 


Lr730=10, Lor- ry = 52 


tions of the classical regression model are satisfied. 


Carry out the test, assuming that all basic assump! : 
3 5 (University of Michigan, 1981) 


5-7 Consider the following regression model in deviation form: 
Y, = Bixie + BoX2e + Mr 
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with sample data: 
n=100,  Ly?= =. Dx? =30, Ex? =3, 
Ixjyy=30 Ex,y=20, Lx,x,=0 


(a) Compute the OLS estimates of B,, B,, and R*. 
(b) Test the hypothesis Hy: 8, = 7 against Hy: B, * 7. 
(c) Test the hypothesis Hy: 8, = 8; = 0 against H,: B, = 0 or B, = 0. 
(d) Test the hypothesis Hy: B= 7B, against Hy; By = 7B. 
(UL, 1981) 


5-8 Given the following least-squares estimates: 
C, = constant + 0.92, + e,, 
C, = constant + 0.84C,_, + e, 
G-1 = constant + 0.78Y, + e3, 
¥, = constant + 0.55CG,_, + e4, 
calculate the least-squares estimates of 8, and Bin 
G = By + BY, + BC, + u, 


(University of Michigan, 1981) 
5-9 Prove that R? is the square of the simple correlation between y and y, where ¥ = X(X’X)'X’y. 
5-10 Prove that if a regression is fitted without a constant term, the residuals will not necessarily sum 
to zero, and R?, if calculated as | — e’e/(y'y — n¥?), may be negative. 
5-11 A researcher wishes to estimate the regression of y on X without an intercept term, that is, X 
does not contain a column of 1s. Unfortunately, the regression program at hand automatically 
computes an intercept term. Douglas M. Hawkins suggests that the program can be “tricked” into 
estimating the correct intercept free regression by entering each data point twice—once in its correct 
form (y,,x,) and once with the opposite sign (—y,, —x,). 
Prove that: 
(a) The “trick” regression and the correct regression (with intercept suppressed) yield the same 
coefficients for X. 
(b) The residual sum of squares from the “trick” regression is exactly double the value from the 
correct regression. 
Compute the ratio of the standard errors of the two regressions. 
(American Statistician, 34, Nov. 1980, p. 233) 
5-12 (a) Prove that R? increases with the addition of an extra explanatory variable only if the F 
(=P) statistic for that variable exceeds unity. 
(b) Prove that the partial 


2 F e 
fae Oa ae 
F+df 24 af 


where 1 is the value of the statistic for testing the significance of the coefficient of the X, to which the 
partial r is related, and df is the number of degrees of freedom in the regression. 
5-13 Let the regression equation be partitioned as 


y=X,B, + XB, +e 


Let b, and b, be the usual least-squares estimators. Suppose that E(e) = X,y, that is, the mean vector 
of the disturbances is a linear combination of some of the Tegressors. Prove that b, is biased but b, is 
unbiased. 

(University of Michigan, 1981) 
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5-14 Suppose that the m x 1 vector x, denotes m observations on the ith individual (i = l,..., p) 
and X, is the corresponding vector of deviations from the ith sample mean. Let the x, and X; vectors 
be “stacked” to give mp X | vectors 


x! = [x, xp. 2+ x) ] 


and x = [X, % --- | 


Find a matrix D such that Dx 


CHAPTER 
SIX 

FURTHER TOPICS IN THE 
k-VARIABLE LINEAR MODEL 


6-1 ESTIMATION SUBJECT TO LINEAR RESTRICTIONS 


In Chap. 5 we have described the procedure for testing the hypothesis that the 
elements of the population vector B obey the set of q (< k) linear restrictions 
embodied in the relations 

Hy: RB=r 


If Hp is not rejected, one may wish to reestimate the model, incorporating the 
restrictions in the estimation process. One important reason for such reestimation 
is that it will improve the efficiency of the estimates. This produces an estimator 
by which then satisfies 

Rb, =r (6-1) 


For example, if the hypothesis of constant returns to scale is not rejected for a 
production function, the reestimation process would yield a production function 
with estimated elasticities which sum to unity. 

We must first of all show how to derive an estimator b, which satisfies Eq. 
(6-1). Second, we will use this estimator to cast new light on some of the test 
procedures of Chap. 5, and third, we will look at some important applications of 
the new estimator. 

The assumed model, as before, is 


y=XBp+u 
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We define the scalar function 
@ = (y — Xb,)(y — Xb,) — 2X(Rb, — 1) (6-2) 


where \ denotes a column vector of q Lagrange multipliers. Taking the partial 
derivatives of p gives 


99 _ _ oxy +2X'Xb, — 2R’A 
aby 


a 
and se = —2(Rb, — r) 


Setting these partial derivatives to zero gives the equations to be solved for b, and 
\, namely,} 


X’Xb, — X’y — RA =0 (6-3) 
Rb, —r=0 (6-4) 
Premultiplying Eq. (6-3) by R(X’X)~' gives 
Rb, — R(X’X) 'X’y — R(X’X) 'RA=0 
Using Eq. (6-4) and resurrecting the OLS estimator of Chap. 5, that is, 
b = (X’X) 'X’y 
this equation may be solved for \ as 
= [R(X’x)'R]'(r- Rb) 
Substituting back in Eq. (6-3) gives 
by = (XX) 'X'y + (XK) 'R'[R(X’X) 'R'] '(r — Rb) 
that is, 
by = b + (XX) 'R[R(X’X)'R’] (r - Rb) (6-5) 
where b is the unrestricted OLS estimator (X’X)~'X’y. Formula (6-5) defines the 


restricted least-squares estimator satisfying the set of q restrictions embodied in 
Rb, = r.t Corresponding to by, we may define the residual vector 


e, = y — Xby 


; To keep the notation as simple as possible, we have not distinguished between the vectors b, and 
\ which appear in the objective function, Eq. (6-2), and the specific vectors that emerge as the 
solutions to Eqs. (6-3) and (6-4). : 
Provided the restrictions RB =r are true, the variance-covariance matrix of the restricted 
least-squares estimator may be shown to be 
S sd eA eu =i 
var(b,) = 0?((X’X) | - (XX) 'R'[ROXX) R’] 'R(X’x)'} 
See Problem 6-6. We should also note that in some problems it may be simpler to obtain by by 
imposing the restrictions directly on the problem rather than by substituting in Eq. (6-5). For example, 
suppose the data are already in deviation form and we wish to estimate 
Y = ByxXz + Byx3 + 4 
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which may be written 
e, = y — Xb — X(b, — b) 


=e — X(b, — b) 
where e is the OLS residual vector. Transposing and multiplying 
e,e, = e’e + (b, — b)’X’X(b, — b) 
the cross-product term vanishing since X’e = 0. Thus the difference between the 
restricted and the unrestricted residual sums of squares may be written 
e,e, — e’e = (b, — b)’X’X(b, — b) (6-6) 
Substituting for b, — b from Eq. (6-5) and simplifying gives} 


~'(r — Rb) (6-7) 


ees — e’e = (r — Rb)’[R(X’X) 'R’] 
The right-hand side of Eq. (6-7) is exactly the expression in the numerator of the 
F statistic for testing Hy: RB =r derived in Eq. (5-68). Thus an alternative 
expression of the test statistic for Hj: RB =r is 


(exes — e’e)/q i 
e’'e/(n —k) (6-8) 


where ee, denotes the restricted residual sum of squares derived from the vector 
b,, which satisfies the q restrictions Rb, =r, and e’e denotes the unrestricted 
residual sum of squares from the usual OLS regression. We have already derived 
this result for one particular application in Eq. (5-76), but the derivation leading 
up to Eq. (6-8) is perfectly general and applies to all cases. 

To summarize, the test of the hypothesis that the elements of B obey a set of g 
(< k) linear restrictions embodied in 


Hy: RB=r 


may be carried out by computing the unrestricted OLS vector b and the residual 
vector e and then calculating the F statistic, Eq. (5-68), 


F= 


_ (r= Rby[R(X’x) _'R’] '(r — Rb) /a 
e'e/(n—k) 
rejecting Hy if F exceeds a preselected critical value taken from the F distribution 


with q, n — k degrees of freedom. Alternatively one may compute the restricted 
vector b, from Eq. (6-5) or otherwise, and the corresponding residual vector 


FE 


subject to the restriction 8, + B, = 1. Substituting the restriction in the equation gives 
Y= B(x. — x3) tx + 


so that the regression of y — x; on x, — x; yields an estimate b,, and b, is then obtained from 
b, = | — by. For the general version of this approach, see Problem 6-11. 

+ As shown in the footnote on page 184, [R(X’X)~'R’]~' is positive definite. Thus ese. — ee >, 
with equality only when Rb = r, Imposing restrictions cannot lower the residual sum of squares. 
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e, = y — Xb,. The test statistic is then 
(b, — b)’X’X(by — b)/q 


ea e’e/(n — k) 


(6-9) 


or equivalently 
pu (Geta = €e)/a 
e’e(n — k) 
One of the most useful applications of these formulas is in tests of structural 
change. 


6-2 TESTS OF STRUCTURAL CHANGE 


Example 6-1 Suppose we have data on two variables, 
Y = consumption expenditure 
X = disposable income 


The data cover two distinct subperiods, n, observations relating to wartime 
years and n, observations relating to peacetime years. Suppose we wish to 
investigate whether there is any change, or shift, in the consumption function 
between the wartime and peacetime periods. Such a change is referred to as a 
structural change or structural break. Let us denote the consumption func- 
tions by 

Y=a,+8,X+u wartime function (6-10a) 


Y=a,+f,X+u peacetime function (6-10) 


This is the unrestricted form of the model, allowing intercepts and slopes to 
be different in the two periods. This model would be set up in matrix form as 


follows: 
Y, uy 
Y, NX, O10 us 
: Peo) Le feral a ; 
ce fia (cen || hae ety 
Yo O° HOE) TP ee Pee Uny+l 
Yi, 42 0. 0 1 ae B, Un, +2 
Dube Rae ap 
Untn, 


Yuen 


where the wartime observations have been listed first and the peacetime 
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observations last. More compactly Eq. (6-11) becomes 


a 

y xX, O}/8 
r= ["]=[% 4 a, |+¥=XB+u (6-12) 

2 


where the data matrix X is block-diagonal.} Notice that each of the sub- 
matrices X, and X, has a column of units in the first position followed by a 
column of observations on income, and B indicates a column vector of the 
four structural parameters. Applying OLS to Eq. (6-12) gives 


a 
b 
b= | -+| = (Xx) 'xy 

a, 

b, 

ek) 0 | 
0 %x,)']UGr 

(XX) 'Xy, 


(6-13) 


(X,X2) 'Xoy, 
These estimates are seen to be identical with those obtained by applying OLS 
separately to Eqs. (6-10@) and (6-105). One merely sets the data up in the 
form of Eq: (6-12) and a single regression will produce all four regression 
parameters. Using Eq. (6-13) one can then obtain the vector e of m, + "2 
residuals, and e’e gives the unrestricted residual sum of squares. 

Now set up the null hypothesis of no structural change. This may be 


formulated as 
. a, MI a, » 
Ho: ae a (6-14) 


or, putting it in the RB = r framework, 


a 
» [1..0,.0. 41 4 0]) Bi]. [6 
os [3 Tr m0 elie -(0] 
B, 
so that 
R=[I-1] and r=0 (6-15) 


+ When using computers the student must take care to understand the properties of the program 
being used. If the program automatically estimates an intercept, feeding in the block-diagonal x 
matrix would produce a linear dependence between the column of units supplied by the computer and 
the first and third columns of X. Thus one must either feed in X as it stands and suppress the 
automatic intercept, or else allow the automatic intercept and modify the X matrix in a way to be 
discussed later in this section. 
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where I is a unit matrix of order 2. The restricted model may thus be 
formulated as 


Cy (6-16) 


The contrast with the unrestricted model in Eq. (6-12) is that the X, matrices 
are now stacked vertically, so that only two parameters are required to 
describe the relation. 

We now have two alternative procedures for testing Hy. Using the 
unrestricted b computed from Eg. (6-13) and the R matrix and r vector 
defined in Eq. (6-15), we can calculate the F statistic defined in Eq. (5-68). 
Alternatively, we may compute the e, vector that comes from fitting the 
restricted model (6-16) by OLS and substitute either in Eq. (6-8) or in Eq. 
(6-9). The second procedure is the simpler, but we will illustrate both with the 
following data. Again these are artificially simple numbers, designed to keep 
the arithmetic to the minimum and highlight the methods. 


Wartime data 

1 ae 

2 [Non 

y, =| 2 X,=|1 6 

4 1 10 

6 EOhH3' 
Peacetime data 

1 1 2 

3 jRoa ef 

3 1 6 

5 Dee 

6 1 10 

Sid be X= Y 

7 1 14 

9 1 16 

9 1 18 

ll 120 


The sample sizes are 
ny =5 n,=10 n=15 


and we have 
2 Si 835 ’ Boney 110) 
oixn=[55 325 ox) =[119 1540 
Y\= ai ee —35. potas pelea be Peel 
(x, Xp = 00, Al (XX2) TEEN) aii 
’ 15 Fee) 
XY fis X4y. =| | 
Y2Y2 = 448 


Yiyr = 61 
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Substitution in Eq. (6-13) gives the unrestricted OLS estimator 


gba if ~ 0.062500 
|| - (XiX:) "Xin | _ | 0.437500 
be] | (X)X2)"'Xay2 0.400000 
0.509091 


Thus the estimated regressions are 
¥ = —0.0625 + 0.4375X — wartime 
and ¥ = 0.4000 + 0.5091X peacetime 


These point estimates give the wartime function a smaller intercept and lower 
slope than the peacetime function. The residual sum of squares from the 
wartime regression is 


ee, =yiy, — iXiy, 
= 61 — [—0.0625 0.4375]| 18 = 61 — 60.3125 
= 0.6875 
Similarly for the peacetime regression 
ee) = Yz¥2 — BL Xby, 
= 448 — [0.4 0.509091]| ,6| = 448 — 445.5273 
= 2.4727 
Thus the unrestricted residual sum of squares is 
ee = ee, + efe, 
= 3.1602 
In fitting the restricted model, Eq. (6-16), the data matrix is now 


giving 
(X4X.) . (XX, Ry XX) 
Xy = Xiy, + Xby, 
From the data, 


a [) 15. 145 if 75 
OMe) = | nas ig6s| Bd Xey = bots 


The restricted coefficient vector is 


ie 1 [ 1865 il 75) 8 ciegeeeell 
b 6950 | — 145 15} | 968 0.524460 
Thus the estimated common regression is 


¥ = —0.0698 + 0.5245x 
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and the restricted residual sum of squares is 
ee, = (61 + 448) — [—0.069784 0.524460) | oe = 509 — 502.4435 


= 6.5565 
Thus substitution in Eq. (6-8) gives 
(6.5565 — 3.1602)/2 
3.1602/(15 — 4) 


Notice that the number of degrees of freedom in the numerator is 2, since 
there are two restrictions embodied in Ho, and in the denominator n — k = 
15 — 4 since four regression coefficients are estimated in the unrestricted 
regression. From the tables of the F distribution, 


Fyos(2s11) = 3.98 and Fo o9(2, 11) = 7.21 


Thus the hypothesis of no structural change would be rejected at the 5 
percent level of significance, but not at the | percent level. 

The alternative approach is to calculate only the unrestricted b vector 
and substitute directly in Eg. (5-68). This requires the evaluation of 
R(X’X)~'R’. From Eqs. (6-12) and (6-15), 

1 
—] 


F= = 5.91 


(X4X,) ' 0 

0 (XX) 

= (X;X,) 7 + (XX) 
1.27916667 —0.12083333 

—0.12083333 0.01553030 


R(X’X) 'R=[I -I] 


with 
vas a ran (ree | ea 
[R(xx) R] ~ |22,965567 242.949711 
— 0.062500 
2 0.437500 | _ Poses 
Ree {t —1 ~ 9.400000 —0.071591 
0.509091 
and r = 0. Thus 


(r — Rb)'[R(X’X)_'R']'(r — Rb) = 3.3969 
and from the previous calculations, 
ee, — e’e = 6.5565 — 3.1602 = 3.3963 
so that the two numbers agree, subject to rounding errors in the calculation. 
The F statistic from Eq. (5-68) is 


3.3969/2 


= = 5. bef 
3.1602/11 5.91 as before 
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Example 6-2: Tests of change in the regression slope Example 6-1 showed 
how to test the hypothesis 


Tal La 


The restricted and unrestricted models are pictured in Fig. 6-1. 

Sometimes the investigator is more interested in testing for the homo- 
geneity of the regression slope, the values of the intercept term being of no 
particular importance. The null hypothesis is now specified as 


" Ho: By = 8 (6-17) 
The a parameter is free to take on different values in the two subperiods. For 
instance, in simple Keynesian theory the size of the national income multi- 
plier depends only on the marginal propensity to consume f and not at all on 
the intercept a. Thus the Hy in Eq. (6-17) is equivalent to asking whether the 
income multiplier is the same in each subperiod. The restricted and unre- 
stricted models may then be set up as follows: 


Restricted Unrestricted 
a 
- a. 
[1-2 0 “| ye a | -[5 x; u ile rae 
Y2 0 i, X2]] B Y2 0 0 i,¢x, 
2 


(6-18) 


where i, denotes a column vector of 7, units, i, a column vector of 7, units, 
X, a column vector of the n, observations on wartime income, and x @ 
column vector of the 7, observations on peacetime income. OLS may then be 
applied directly to each model in Eqs. (6-18) and H, tested by comparing the 


¥ 
} 6 (2,82) 
(a 4,8) 
a oe 
o = =2.6 o > XK 
(a) (b) 


Figure 6-1 (a) Restricted model; (6) unrestricted model. 


oO 


> 
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residual sums of squares from the restricted and unrestricted models in the 
usual way. The two models are shown in Fig. 6-2. 

The unrestricted model in Eqs. (6-18) is exactly the same as that in Eq. 
(6-12), so we already have 


e’e = 3.1602 
For the restricted model, 
HS (LPC ies $7 
= [: i, di 
so 
ny 0 ix, ity, 
X,X, =| 0 ny ix, Xiy = iby. 
xii, Xai. XIX) + XDK2 xi, + X2Y2 
5 0 35 15 
=| 0 10 +110 =| 60 
35 110 1865 968 
Thus 


eee = Vy — ¥'Xa(X4Xe)  Xay 


0 -1 
= (61 + 448) —[15 60 sa 0 10 10| \é i 
968 


110 1865 

6550 3850 ~—350 

= 509 -— [15 60 968]} 3850 8100 vai 
2 -350  -550 es 


= 509 — 505.5098 = 3.4902 


—X oO —=X 


(a) (b) 


Figure 6-2 (a) Restricted model; (b) unrestricted model. 
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The test statistic for the null hypothesis that 8, = 8, is then 
3.4902 — 3.1602 
fir 3.1602/11 dad 
and Fo 95(1, 11) = 4.84. Thus there is no evidence of a significant difference in 
the regression slopes in the two periods. Since the F statistic in Example 6-1 
was on the borderline of significance, this suggests that any change beiween 
the two periods lies in the intercepts rather than the slopes. 

Before leaving this example, let us note a much simpler way of calculat- 
ing e4e,, based on deviations from the sul period means. We have earlier 
introduced in Eq. (5-41) the A matrix which transforr)s a vector of n 
observations into deviation form. Let us define A, and A, to be such matrices 
for use with vectors of n, and n, observations, respectively. Premultiplying 
the restricted model by the block-diagonal matrix 


A, 0 
0 A, 
then gives 
Be _f9 9 Am 
Ajay, 0 0 Ax, B 


since A,i, = 0 and A,i, = 0. The 8 parameter is thus estimated by a simple 


regression of 
ig Ax, 
A2y, A2Xx) 


Each vector consists of two subvectors, the first being the deviations of the 
wartime Y (or X) from the wartime sample mean aad the second the 
deviations of the peacetime Y (or X) from the corresponding peacetime 
sample mean. Denoting the vectors of deviations by ¥ and %, respectively, 


»- 2 
xX 
and yey = HF — H'R(X'R) |S 


Computing the deviations for the two subperiods and evaluating these 
expressions gives 


203 
b= 410 > 0.4951 
and 
ie il ies ca k203) 
e,e, = 104 410 


= 3.4902 as before 


Example 6-3: Testing for structural change in the intercept The null hy- 
pothesis is now 

Hye ee; (6-19) 
We must be very careful in the specification of the restricted and unrestricted 
models. By analogy with Example 6-2 it might seem reasonable to specify the 
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restricted model as 

ni a 

y, iy 0 
[]-[f ae i rh (620) 

2 2 B, 
with the unrestricted model as before. Model (6-20) imposes a common 
intercept but specifies different slopes. If the functions have different slopes, 
they must intersect at some X value. There may be cases where it is relevant 
and important to test that the intersection occurs at X = 0, as is implied by 
specifying Eq. (6-20) as the restricted model. However, this is not usually the 
case, and the most common practice is to test Hy, subject to the assumption of 
a common regression slope. Thus the restricted and unrestricted models 
become 

Restricted Unrestricted 


ex fo) xe 
? Sle ee [2 |-[¢ i | %!}+u (6-21) 
Y2 Hy Xp Y2 0 i, x2 B 
Notice that the unrestricted model in this example is the restricted model of 
Example 6-2 [see Eqs. (6-18)], and the restricted model here is the same as the 


restricted model in Example 6-1. Thus from our previous calculations the 
relevant sums of squares are 


e4ey = 6.5565 
and e’e = 3.4902 
Thus the test statistic for Hp: a, = @, conditional on a common 8, is 
6.5565 — 3.4902 _ 
ES ves OMIA ae oe 


and Fy 99(1, 12) = 9.33 so that the difference in the intercepts is significant at 
the 1 percent level. The models are shown in Fig. 6-3. 


-—~< 


(e) ~X 
(a) (6) 


Figure 6-3 (a) Restricted model; () unrestricted model. 
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Y ¥ 
4 


re) ~X (e} x 
(a) (b) 


Figure 6-4 


This test is a simple example of the analysis of covariance, which has 
widespread applications, Suppose, for instance, that Y denotes the yield of wheat 
per acre and X indicates hours of sunshine. One set of observations relates to 
strain 1 of wheat and the other set to strain 2. The crucial question is whether one 
strain shows a significantly different yield than the other, but suppose that the 
experimental plots sown with the two strains have not received equal amounts of 
sunshine, The difference between the sample means would then not only reflect 
any possible difference between the strains, but also the difference due to the 
variable hours of sunshine received. In the analysis of covariance, hours of 
sunshine would be termed an intervening variable. The problem is depicted 
graphically in Fig. 6-4. 

In Fig. 6-4a the single line denotes the assumed positive relationship between 
yield and hours of sunshine, and the ellipses denote the samples from each strain. 
The difference between the sample means is here due solely to a different set of X 
values for the two varieties. In Fig. 6-45 strain 2 is assumed to have a greater yield 
than strain 1, a difference indicated by the difference between the intercepts, 

d=a,-—a@ 
The observed difference between the sample means is then d plus or minus any 
differential effect due to sunshine. In testing the hypothesis 

Ho: a = a, 
proper allowance must be made for any possible interference from the intervening 
variable, but this is precisely what is achieved by the test of the models in Eqs. 
(6-21), namely, the test of Ho, conditional on the assumption of a common f. 


Summary 
These three te are based on a hierarchy of models: 


4y Common regression 
i . “li - for both periods 
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: a 
ll YW) os : xX) oe bie differential intercepts, 
y2 0 ex; B common slope 
my 
Il My} _ i, x, 0 O78, Fe differential intercepts, 
Yn De Ootiz ees) differential slopes 
B, 


Fitting each model by OLS produces a residual sum of squares. There are 
three basic tests on the differences between the various residual sums of squares. 
These are the following: 


. Test of differential intercepts—model I contrasted with model II 
«Test of differential slope coefficients—model II contrasted with model Ii 
. Test of differential regressions—model I contrasted with model III (slopes and 


intercepts) 


The tests outlined above have implicitly assumed that the disturbance vari- 
ance 2 is the same in each period. Schmidt and Sickles have investigated the 
effect of departures from this assumption on the significance level of the test. For 
equal-sized samples there are modest increases in the true significance level over 
the nominal level, even for very large departures from the assumption of equal 
variances. For instance, with n, = n. = 25 the true significance level only rises to 
0.059, compared with a nominal value of 0.05, when one variance is 100 times the 
other, If the X variable is a linear trend, the true significance level rises to 0,063 
for a tenfold increase in the variance and to 0.084 for a one-hundredfold increase. 
When the sample sizes are unequal, the true significance level shows a greater 
departure from the nominal level, and it may now be less or greater than the 
nominal level, Full details are given in the reference. 


Example 6-4: Tests of structural change (x variables) The previous three 
examples have only been concerned with a two-variable model for two 
subperiods. The tests need to be generalized in two directions, namely, 
extending to k variables and also making comparisons between more than 
two subperiods. In this example we make the first extension. 

The unrestricted model is now 


E}-[e lle 


where X, is of order n, Xk, X, is of order n, Xk, and B, and B, each 
denotes vectors of k coefficients. This model has exactly the same matrix form 
as Eq. (6-12), the only difference being in the number of variables in the 
X,,X, matrices. Let us partition X, and X, by the first column of units and 
the remaining k — 1 columns of observations on the explanatory variables as 


+u (6-22) 


+P. Schmidt and R. Sickles, “Some Further Evidence on the Use of the Chow Test under 
Heteroscedasticity,” Econometrica, 45, 1977, pp- 1293-1298. 
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follows: 
xX, = [i xi] 
and X,=[i, X3] 


We may then construct the same hierarchy of models as in the two-variable 
case. This now gives 


common regression 
for both periods 


A «| % differential intercepts, 
y i, 0 Xf 
Il = Py) ae be ca I common vector of 
ve ESE 2 -B* regression slopes 
Cf 
m |"- i, 0 Xt 0 |] %) differential intercepts, 
y.| [0 i, 0 xg]}Bf}~" — differential slopes 
BF 
where we have partitioned the k-element B vector as 
my 
B, 
a 
p= |8|-[3| 
By, 


Application of OLS to each model will yield a residual sum of squares (RSS) 
with an associated number of degrees of freedom as indicated by 


Model I RSS, n-k 
Model IT RSS, Mike 
Model IIT RSS, n-—2k 


where » = n, + ny indicates the total number of observations in the com- 

bined samples. The test statistics for various hypotheses are then as follows: 

Hy: «, = a: Test of differential intercepts 
ee RSS, — RSS, 

RSS,/(n — k = 1) 

Hy: Bt = By: Test of differential slope vectors 


p — (RSS: = RSS;)/(k = 1) 
RSS,/(n — 2k) 


~ F(1,n—k-1) (6-23) 


~F(k—1,n—2k) (6-24) 
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Hy: B, = By: Test of differential regressions (intercepts and slopes) 
_ (RSS, — RSS,)/k 
RSS,/(n — 2k) 

The degrees of freedom in the numerator are simply obtained as the difference 
in the degrees of freedom of the residual sums of squares in the numerator. 
This is equal to the number of restrictions involved in going from the 
unrestricted to the restricted model. For example, only one restriction is 
imposed in going from model II to model I, and k — 1 restrictions (equality 
of regression slopes) are imposed in going from model III to model II. 

However, a further test is possible in the k > 1 case, which did not arise 
in the two-variable model. We may now test whether a subset of coefficients is 
stable over the two periods. For example, most wage equations take the form 

Wage change = (market pressure, expectations of inflation) 
One might wish to test whether the reaction to the market pressure variables 
has changed between two periods, or alternatively, one might hypothesize 
that the reaction to inflationary expectations is different in “high” inflation 
and “low” inflation periods. The principle of the test is the same as in all the 
previous examples, as shown by the following rule. 

Fit the restricted model with the subset of coefficients whose stability is 
being tested, taking the same value in each subperiod, and compute the 
residual sum of squares e4e,. All other coefficients are left to vary between 
the two subperiods. Then fit the completely unrestricted model, where all 
coefficients are free to vary, with the residual sum of squares e’e. The test of 
the stability of the subset is then based on 


pn (ate = €0)/4 

ee/(n —k) 
where q indicates the number of coefficients in the subset. Formally the 
restricted model is set up as 


Bi 
y x 0 Xp 
*]-| es ee “7 a +u (6-26) 


F(k,n — 2k) (6-25) 


where X,,=n, X (k — q) matrix of observations in period | on the variables 
not being tested 
X,, =n, X (k — q) matrix of observations in period 2 on the variables 
not being tested 
X,) =n, X q matrix of observations in period 1 on the variables in the 
test 
Xy =n, X q matrix of observations in period 2 on the variables in the 
test 
B,,, B.; =coefficient vectors of X,, and X,,, respectively 
B, =common coefficient vector for X,) and X) 
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Example 6-5: Tests of structural change (n, < k) A special problem arises if 
one of the subperiods has fewer observations than the number of parameters 
to be estimate. in the model. Let us assume that we have n, (>k) 
observations in one subperiod and n, (< k) observations in the other. There 
is no difficulty about the restricted model in which one set of k parameters is 
estimated for the n (= n, + ,) sample observations, namely, 


y | L xX, 

Y2 X, 
and e4e, has n — k degrees of freedom. If n, = k, the unrestricted model can 
be fitted and will have a residual vector 


fl 


where e, =y, — X,b, 


b, +e, 


denotes the residual vector from the first regression, and 0 is a k-element 
residual vector from the second regression, in which the regression plane fits 
the k observations exactly. The residual sum of squares e’e has 


n, +n,—-2k=n,-k 
degrees of freedom. If n, < k, all k parameters cannot be determined for the 
second period, but the residual vector is still-0 since an infinite number of 


hyperplanes of dimension & can be passed through a set of less than k 
observations. Thus the unrestricted residual sum of squares is still 


ee = ele, with n, — k degrees of freedom 
Analogy with the previous tests suggests that the appropriate test of the null 


hypothesis that the n, additional observations belong to the same structure as 
the first m, observations is based on} 


F= (aes = e(e;)/nz (6-27) 
eje,/(n, — k) 
where n, is given by (n — k) — (nm, — k), the difference between the number 


of degrees of freedom of the sums of squares in the numerator. Thus the 
practical procedure is as follows: 


*Fit the regression to all n, +m, observations, giving the residual sum of 
squares e@4e,. 

+ Fit the regression to the 1, observations, giving the residual sum of squares 
eje). 

*Compute the F statistic defined in Eq. (6-27) and reject the hypothesis 
of a common structure if F exceeds a preselected critical value from 
F(nz, n, — k). 


+ This is only a heuristic proof. For an exact derivation of Eg. (6-27), see F. M. Fisher, “Tests on 
Equality between Sets of Coefficients in Two Linear Regressions: An Expository Note,” Econometrica, 
28, 1970, pp. 361-366. An alternative proof is given in Sec. 10-1. 
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Example 6-6: Tests of structural change (k variables, p periods) The final 
extension is going from two periods to more than two periods. For example, 
we might wish to test whether a Phillips curve has the same structure prior to 
World War I, between the two world wars, and post-World War II. But the 
test need not be across periods. We might examine the stability of a relation 
across countries, industries, social groups, or whatever. 

The usual hierarchy of three models may be set up: 


y 
Y> i, X} 
1 || ={i2 ¥8|[ee] +e 
¥ Ler 5 
common intercept, common slope vector in all p classes 
y Fl 
2 i, 0 0 xt] % 
| |=) Oa; Ox +u 
ee i, X* |] % 
% Be 
differential intercepts, common slope vector 
bast 
a 
a i, 0 0 xt’ 0 0 TI a, 
Te bee le I Gees Oy Os Xe 0 |] ge | +u 
: 0 0 + fee X* |] pe 
Yp : 
By 


differential intercepts, differential slope vectors 


Here i, is the column vector of n,; units (i = 1,2,...,p) and xt is the 
n, X (k — 1) matrix of observations on the explanatory variables in class / 


(i = 1,2,..., p)- 
The residual sums of squares from the three models have the number of 


degrees of freedom 
=k, n=p—k +1, and n— pk 
where 


n=n,t+n,t+:--+n, 
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Table 6-1 
Class 
1 2 3, 4 

Observation Y x Y x ig x Y x 

1 22 29 30 15 12 16 23 5 

2 22 20 32 9 8 31 25 25 

3 20 14 26 i 13 26 28 16 

4 24 21 26 6 25 35 26 10 

5 12 6 37 19 7 12 23 24 ¥ x 
Sums 100 90 150 50 65 120 125 80 440 340 
Means 20 18 30 10 13 24 25 16 22 17 


denotes the total number of sample observations. The various hypotheses 
may then be tested by contrasting residual sums of squares in the usual way.} 


Example 6-7 This illustration is based on p = 4 classes, but for simplicity of 
calculation we have kept k = 2 (Table 6-1). 

Denoting the data matrix in model I by X,, the residual sum of squares 
for model I is 


RSS, = y'y — y'X4(X4X,)'X4y 
eae Xie | 4 LY 
Bid xX XLXY 


where the summations are over all 20 observations, 


= ry? — [sy Exv]| 


=! 
RSS, = 10,876 — [440 7288}| 2° oe se 


= 1174.1 


Using the same approach, the residual sum of squares for model II is 
given by 


RSS) = EY? —[E)¥ DY U,Y 2,Y LXY] 


ny 0 0 Oey xX -1] UF 
0 ny 0 Oi eX: LLY 
alin Obaane Ob aes ras og Ae Bae 
0 0 0 Via 37. G r,4Y 
9 Gh Dow? Cheb 3 Gober by, Caio Ge LXY 
where ¥, indicates summation over observations in class i and L indicates 


+ It should be emphasized that the tests for structural change discussed in this section assume that 
the researcher has strong a priori views about when or where the potential change(s) occurred. 
Additional complexities arise when the switch point(s) may be unknown. There is also the possibility 
of a transition phase between regimes. There is some discussion of these topics in Sec. 10-4. 
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summation over all classes, 
RSS, = 10,876 — [100 150 65 125 7288] 


Su) Ovre0! 90 }-1} 100 
OMS 0 0 50 150 
354 (Oe WO 5). 0 t20 65 
0 0 OS) 180 125 
90 50 120 80 7462 7288 

= 251.0 


As shown in Example 6-2, this sum of squares may be more easily calculated 
by first of all expressing the data in the form of deviations from class means, 
pooling all the deviations and calculating RSS, as the residual sum of squares 
from the regression of the Y deviations on the X deviations. Table 6-2 shows 
the data in deviation form. The residual sum of squares from the regression 
of the Y deviations on the X deviations is then 


= Er (Y,— (Xo XI) 
ass, = (4, — 7 uae We NT 
y E,,(%, ~ %) 


* _ (134+ 117 + 177 + 0) 
= (88 + 94 + 206 + 18) — 5945 304 + 382 + 302 


= 251.0 as before 
Finally, we need to obtain the residual sum of squares from the com- 
pletely unrestricted model, model III. This is the sum of the residual sums of 
squares obtained by fitting a linear regression to each class separately. These 
are most simply obtained from Table 6-2. They are 
134)? 
eje, = 88 — (34) 


= 26.925 
2 
1 
ee, = 94 - on = 26.897 


2 
ese, = 206 — vy = 123.987 


0 
eje, = 18 — 399 = 18 


giving 
RSS, = 195.8 


The various tests may be set up in an analysis of variance framework as 


shown in Table 6-3. 
The test for a common regression slope is 


_ (RSS, — RSS,)/3 _ 18.4 


RSS,/12 16.3 
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Table 6-3 
Mean 
Model Residual sum of squares Degrees of freedom square 
I RSS, = 1174.1 n—-k=18 

Il RSS, = 251.0 n-p-k+1=15 16.7 

mW RSS, = 195.8 n— pk = 12 16.3 
RSS, — RSS, = 923.1 3 307.7 

RSS, — RSS;= 55.2 3 18.4 

RSS, — RSS; = 978.3 6 163.0 


which is insignificant. Thus we do not reject the assumption of a common X’ 
effect in all classes. The test for common intercepts (conditional on a common 
slope) is 

+ (RSS, — RSS,)/3 _ 307.7 _ 


Reis) 7 er 


and F)99(3, 15) = 5.42. Thus we conclude that a “class effect” is established, 
that is, that the levels of the regressions do appear to differ between classes. 
The test for overall homogeneity of regressions across classes is 


(RSS, — RSS;)/6 _ 163.0 _ 


RSS,/12 rtaey wee 


and Fy99(6, 12) = 4.82, so that this too is a highly significant result, but it 
would appear that the significance is due to variation in the intercepts and 


not in the slopes. 


6-3 DUMMY VARIABLES 


Dummy variables have already made their appearance in the previous section, but 
we have not explicitly labeled them as such. For example, the unrestricted model 
in Eqs. (6-21) specifies a consumption function which has different intercepts, but 
a common slope, in wartime and peacetime periods. The specification is repeated 


here 
, ; % 
id es oe a a} +u (6-28) 
v2 0 i, *2]) 8B 
where the subscript 1 refers to wartime and the subscript 2 to peacetime. This 
model may be written as 
Y, = 0D, + @,D2,+ BX,+u, t= Vi, 2502501 (6-29) 


D,, and D,, are dummy variables whose sample values are given in the first two 
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columns of the data matrix in Eq. (6-28). That is, 


ee 1 if ¢ indicates a wartime observation 
ie 0 if ¢ indicates a peacetime observation 
and 
p,-{9 if indicates a wartime observation 
eis 1 if ¢ indicates a peacetime observation 


Notice that the model in Eq. (6-29) has no general intercept term. If one runs a 
regression of Y on D,, D,, and X with a computer program that automatically 
produces an intercept term, the estimation procedure will break down (or possibly 
give nonsense coefficients, which are merely ratios of rounding errors) since D, 
and D, sum to the unit vector, The practical alternatives are 


1. Run Eq. (6-29) with the general intercept suppressed. 
or 
2. Reformulate Eq. (6-29) as 


Y= 7 + D2, + BX, + u, (6-30) 
and run with the standard OLS program. 


Comparing intercepts in Eqs. (6-29) and (6-30) gives 


Equation Equation 

(6-29) (6-30) 
Wartime intercept a NN 
Peacetime intercept a nt 


Thus the relation between the a’s and the y’s is 


N= and Yo Oa 
The model of Eq. (6-30) may then be put in matrix form as 
pee a 
[RI HS ee |e ea (6-31) 
Y2 op othe £35 B 


The choice between the two estimation procedures is of no great importance, but 
it is very important to be clear about precisely what is being tested in either 
model. For instance, testing the significance of D, in Eq. (6-30) is, in effect, testing 
the hypothesis 
Hy: a,-—a,=0 

which is testing whether the peacetime and the wartime intercepts are significantly 
different, whereas testing the significance of D, in Eq. (6-29) is asking whether the 
peacetime intercept is significantly different from zero. 
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The dummy variables may also be allowed to interact with the X variable. 
Consider 


Y =a,D, + aD, + B,(D,X) + B,(D.X) + u (6-32) 
where the subscript t has been omitted for simplicity. Equation (6-32) implies two 
separate relations, namely, 

Y=a,+f,X+u wartime function 
Y=a,+f,X+u _ peacetime function 


Thus performing a single regression of Y on D,, Dj, D,X, and D,X with the 
general intercept suppressed is equivalent to fitting separate regressions to the two 
subperiods. An alternative formulation of Eq. (6-32) is 


Y =a, + (a) — a,)D, + BX + (B, — B,)D,X + u (6-33) 
This corresponds to a regression of Y on D,, X, and D,X with a constant term. 
One advantage of Eq. (6-33) is that testing the significance of the D, X variable is 
a test of the hypothesis 

Hy: B,- 8, =9 

Thus we see that, in the two-variable model, the tests for homogeneity of 
intercepts and homogeneity of slopes are equivalent to tests of the significance of 
single coefficients in an appropriately specified regression equation using dummy 


variables. 
Dummy variables may also be usefully applied in more complex models. For 


the data of the last numerical illustration we may specify 
Y =a, + (a, — a) D, + (a3 — a1) Ds + (a4 — a) Dy + B\X 
+ (B, — B,)D.X + (Bs — B,)D;X + (By — B,)DyX + u (6-34) 


where 
{ 1 for an observation in class / i=2,3,4 
0 otherwise 


Equation (6-34) allows intercepts and regression slopes to vary across all four 
classes, The choice of the class not to be represented by a dummy variable is 
arbitrary, but once it is made, the coefficients of all the other variables are 
differences from the coefficients of that class. 
Estimating Eq. (6-34) from the data of Table 6-1 gives 
¥ =11.7959 + 12.4688D, — 9.9163 D, + 13.2041D, + 0.4558X 
(2.56) (2.19) (- 1.41) (2.13) (1.19) 


+ 0.1177D,X + 0.0076D; X — 0.4558 D,X 
(0.32) (0.02) (- 1.38) 


with R2 = 0.7468 and 12 degrees of freedom.} The figures in parentheses are t 
ratios, and we see that none of the coefficients of the DX variables is significantly 


+L am indebted to G. Gujarati for discussions of this point and also for the calculations. 
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different from zero. This, of course, confirms the homogeneity of regression slopes 
established earlier by the F test. Imposing the assumption of a common regression 
slope gives the revised regression 


Y = 13.4882 + 12.8969D, — 9.1726 D, + 5.742D, + 0.3621X 
(4.79) (4.68) (—3.42) (2.20) (3.04) 


with R* = 0.7341 and 15 degrees of freedom. All three dummies are significantly 
different from zero at the 5 percent level, thus establishing that the intercepts in 
the second, third, and fourth classes are different from the intercept in the first 
class, again in agreement with the earlier F test on intercepts. One advantage of 
this type of dummy variable setup is that in cases where the tests examine the 
Joint significance of a subset of variables the dummy variables can indicate which 
variables may have made the most important contribution to the overall signifi- 
cance of the group. 

The dummy variables specified above play an important role in describing 
temporal effects (where the classes refer to different time periods), spatial effects 
(where the classes refer to different regions or countries), industrial effects (where 
the classes refer to industries), and so forth. Suppose we have qualitative variables 
such as 


+Education (none, grammar, some high school, high school diploma, some 
college, college degree, advanced degree, foreign education) 

* Marital status (unmarried, married 1 year, 2 years, 3 years, 4 years, 5—9 years, 
10-20 years, over 20 years) 

Sex (male, female) 

* Race (white, black, other) 


Only the last two are truly qualitative variables. Education might be treated as a 
cardinal variable, measured by years of formal education, and likewise, duration 
of marriage is a cardinal variable. In both cases, however, we may use groupings 
of a cardinal variable to define a qualitative variable. If a qualitative variable is 
thought to influence some dependent variable, we may use the categories of that 
variable to classify the sample observations into various classes, and the preceding 
method of analysis applies. There are, however, some slight complications if we 
wish to use two or more qualitative variables in a single equation. 


Two or More Sets of Dummy Variables 


Suppose we wish to incorporate two qualitative variables in a regression equation, 
each such variable being represented by a set of dummy variables. To be specific, 
suppose the variables are educational level (3 classes) and sex (2 classes). We then 
define 


{ 1 if observation relates to education level i, i= 1,2,3 
0 otherwise 
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and 
pe f 1 if observation relates to sexj, j= 1,2 
4 0 otherwise 


Suppose we then wish to examine the relationship between hours spent in reading 
nonfiction Y and these two qualitative variables. It is instructive to examine first 
of all what happens if we have only one set of dummy variables in the model. A 
linear model for the influence of E on Y would be written 


Y =a,E, + aE, + 0,£, + u (6-35) 
The data matrix is 
i; 0:70 
E=|]0 i, 0 
0) 20) iis, 
so that 
ny. 0 70 
FE=|0 », 0 
0 O n, 
and the estimated OLS vector is 
a y, 
a,}=|% 
43 ¥, 


so that the OLS regression coefficients are simply the mean values of Y in each of 
the educational classes. If we used the alternative formulation 


Y= a, + afE, + afk, + u (6-36) 


consistency with Eq. (6-35) requires 


as = a) — a and = a} = a, — @ 


The data matrix is now 


| Pama ell 
E* =| i5, iz 0 
i, 0 i, 


and it is simple to show that the OLS coefficient vector is} 


a y, 
as|-|F-% 
as} 1 4-% 


+ See Problem 6-7. 
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Table 6-7 Estimated mean hours 


5 E, Ey Ey 


Sy | 11.14 14.45 23.68 


S 3.79 7.10 16.33 


hours than £,, and E, shows 12.54 more than £,, irrespective of whether we are 
in the S, row or the S, row. This implies the absence of any interaction between 
the two factors, If, however, it is to be expected that the differential sex effect 
varies with the level of education, then an interaction effect exists, and we need to 
see how to incorporate jt into the model and estimate it. 

Returning to Eq. (6-38), we would now expand the relation to read 


Y= + a, 8, + aE, + BS, + ¥,(E,S,) + ¥,(E;S,) +u (6-40) 


There are only two possible interaction variables in this case, and they are found 
by multiplying each E level by each S level. The conditional expected values are 
now shown in Table 6-8. 

The first row is the same as in Table 6-5, but the second row incorporates the 
interaction effects. Thus the sex differential is 


B, for E, 
B, + 2 for E, 
By + ¥5 for E, 
Likewise, the £,/E, differential is 
a, for S, 
a, + ¥2 for S, 
and the E,/E, differential is 
a for S, 
a; + ¥3 for S, 
Table 6-8 E(Y|E£;, S;) 
E E, E; 


—/ 


S, » B+ ay ut ay 
S, pt B B+ a2 + B+ u+a,+ B+ ys 
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Table 6-9 Estimated mean hours 


| E, E, Ey 
Ay 13.33 13.00 20.00 
S) 0.50 10.00 20.00 


Referring back to the data matrix in Eq. (6-39), the data matria for this problem 
would now be 
Ex Bay cSgte BgSzy\EsSe 


X= 
1 
1 
1 isd 
11 1 
Thus 
10 3° ara pied 17 
9 Se io ual ono 36 
PM ash Flo wb pe NG 
at chal MRT aie 
1 “1 40P ei en 10 
Lies O npelistee Lian Cal 20 


The OLS equation is now 
¥ = 13.33 — 0.33E, + 6.67E; — 12.835, + 9.83( £,S,) + 12.83(£3S,) 


and substitution in Table 6-8 gives the estimated number of mean hours shown in 
Table 6-9. 

Compared with the previous regression, where no interaction effect was 
incorporated, we now have a large negative sex effect (— 12.83) at E,, which is 
reduced to —3.00 at E, and eliminated completely at E;. This last result is an 
automatic consequence of our data, where in the interests of simplicity we had 
only one observation in each of the E, cells and also in the E, S, cell. The 
regression values, with interaction, then coincide with these observations. This has 
also distorted the estimate of the E, differential effect to give a small negative 
number (—0.33) for S,, but the calculations do illustrate the principles involved.} 


} This section has only dealt with dummy variables on the right-hand side of the equation. For a 
discussion of the application of dummy variables to the left-hand-side variable see Sec. 10-5, 
Qualitative Dependent Variables. 
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6-4 SEASONAL ADJUSTMENT 


Dummy variables also play an important role in problems of seasonal adjustment. 
These problems are of two kinds. First there is the conventional and long-stand- 
ing problem of deseasonalizing a given quarterly or monthly time series, and 
second there is the problem of estimating an econometric relationship between 
variables that are available in both unadjusted and deseasonalized forms. 

Suppose we have 4n quarterly observations on a variable Y, such as unem- 
ployment, imports, or food prices. Such variables are likely to display a pro- 
nounced seasonal movement, and for purposes of economic intelligence and 
policy it is important to produce a “deseasonalized” series, from which one can 
better assess whether unemployment, say, is really increasing or decreasing. There 
are several methods of deseasonalizing series in practice, but here we are only 
concerned with applications of dummy variables. 

Let us define a 4n X 4 matrix D, 


E40) 0) 70) 
0. 1. 10) 40 
0: “01 10 
D=/0'°0 01 
1 10 0 (0 
OM irs! O 
OP F0' TO: #1 


This is the sample matrix for four dummy variables defined by 


D -( iftoccursin quarteri i= 1,2,3,4 
it 0 - 
otherwise 


If we regress y on D, we obtain 
y=Db+y* (6-41) 


where b is the vector of the OLS coefficients and y* the vector of residuals. From 
the analysis of Chap. 5, 


y* = My (6-42) 
where 
M=I-D(DD) 'D’ (6-43) 
and M is symmetric idempotent with the property 
MD =0 (6-44) 


The series y* cannot serve directly as a deseasonalized series for two reasons. 
First of all, it sums to zero, and it would seem plausible to require a deseasonal- 
ized series to have the same sum as the original, unadjusted series. Second, as 
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shown earlier for model (6-35), 


where Y, (i = 1,..., 4) is the mean of all ith-quarter Y values. Thus y* merely 
consists of deviations of the Y values from the quarterly means. But if the series 
contains trend and /or cyclical components, the elements of b will be an amalgam 
of trend, cyclical, and seasonal effects. Thus subtracting b year by year from the 
actual Y values will not yield satisfactory estimates of a deseasonalized series. The 
remedy is to introduce into the regression a polynomial in time of sufficiently high 
order to represent the trend and cyclical components, so that the coefficients of D 
will be a more satisfactory estimate of the seasonal component. Thus one 
computes the regression 


y=Pa+Db+e (6-45) 
where 
1 ye Ue 
2 2? 2? 
P=} 3 32 3P 
4 4 4? 
4n (4n) (4n)? 
The deseasonalized series would now be defined as 
y*=y-—Db (6-46) 


Jorgenson has argued that if the P and D matrices are properly specified, then a 
and b will be best linear unbiased estimates of the systematic and seasonal 
components, since Eq. (6-45) is then a straightforward example of ordinary least 
squares.} The estimates of a and b are given by 


-1 
a egpead ted (6a) 
Applying the results for the inverse of a partitioned matrix, 
b = (D/ND) 'D’Ny (6-48) 
where 
N=1-P(PP) 'P’ (6-49) 


+D. W. Jorgenson, “Minimum Variance, Linear, Unbiased Seasonal Adjustment in Economic 
Time Series,” Journal of the American Statistical Association, 59, 1964, pp. 681-725. 
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Table 6-10 Quarterly seasonal component of the U.K. Index 
of Industrial Production, 1948-1957 


Seasonal component 


Method by, by b; bg 

Moving average (additive) 3.28 0.77 =7.13 3.08 
Regression on D 1.85 0.35 = TAS: 4.95 
Regression on [P D] (p = 4) 4.87 0.36 -TAT 2.95 
Regression on [P D] ( p = 6) 3.35 0.95 — 7.54 3.25, 


Substituting in Eq. (6-46) gives 
weet 
where 
T=I-D(DND) 'D'N (6-50) 


Thus the deseasonalized series can still be expressed as a linear transformation of 
y. However, in contrast with the M matrix defined in Eq. (6-43), the T matrix is 
not symmetric, though it is idempotent and does satisfy the condition TD = 0. 

As a numerical illustration of these methods we made several estimates of the 
quarterly seasonal component in the U.K. Index of Industrial Production for the 
period 1948-1957. The results are shown in Table 6-10. The centered four-quarter 
moving average is a flexible method for removing trend and cycle, and we will 
take the estimates of the seasonal component in the first row of Table 6-10 as a 
standard by which to judge the various regressions. It is seen that the simple 
regression on seasonal dummies alone gives misleading estimates of the seasonal 
component, apart from the pronounced dip in the third quarter which is well 
picked up by all methods, and it is only when we use a sixth-degree polynomial 
that the results agree closely with those obtained from the moving average 
method. 


Estimation of Econometric Relationships 


Faced with the choice between using raw data or seasonally adjusted data, one 
should think carefully about the basic decision process underlying any behavioral 
relation being estimated. For example, in the study of produ-tion decisions it is 
often found that firms attempt to base production rates on “smoothed” sales 
figures, so that the appropriate regression might be actual production on desea- 
sonalized sales. A salaried worker may have an income with no seasonal compo- 
nent, but consumption expenditures with a strong seasonal component due to 
vacation and Christmas spending. The appropriate model would then regress 
actual consumption on actual income plus a set of seasonal dummies, Income 
itself may have one seasonal pattern and consumption a different seasonal pattern 
with deseasonalized consumption a function of deseasonalized income. If one 
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then wished to explain actual consumption, the appropriate regression is 
Actual consumption = deseasonalized consumption + seasonal component 
= f (deseasonalized income) + seasonal component 
= f(deseasonalized income, dummy variables) 


In many cases theory may give no clear guide to the appropriate regression, 
and as the data are often available in both unadjusted and deseasonalized form, it 
is sometimes difficult to decide in which form to incorporate variables in the 
regression. In practice, however, the problem of specification turns out to be less 
important than might have been expected, because of an important set of results 
due to Lovell,} To illustrate one of Lovell’s basic results, consider the least-squares 
regression 

y = Xe, + Db, + e, (6-51) 
This may be interpreted as a regression of unadjusted Y values on unadjusted X 
values and a set of seasonal dummies. However, it is more instructive to consider 
a more general specification first of all, and simply regard X and D as a 
partitioning of the set of explanatory variables in the regression model, 


y= ify] +e 


The OLS coefficients are then given by 


Aedes a 


Applying Eq. (4-68), the first element in this inverse matrix is 
(xx — XD(D'D) 'prx) | = (Xx)! 
where 
M=I-D(D'D) 'D’ 
This is the M matrix already defined in Eq. (6-43), which we know to be 
symmetric and idempotent and to have the property MD = 0. The remaining 
element in the first row of the inverse matrix is then 
— (XMX)~'X’D(D'D) 
Equation (6-52) may then be solved for ¢, as 
c, = (X’MX) 'X’y - (XMX)~'X’D(D’D) 'D’y 
that is, 
c, = (X’MX)_'X’My (6-53) 


+M. C. Lovell, “Seasonal Adjustment of Economic Time Series,” Journal of the American 
Statistical Association, 58, 1963, pp. 93-1010. The basic result goes back to R. Frisch and F. V. 
Waugh, “Partial Time Regressions as Compared with Individual Trends,” Econometrica, 1, 1933, pp. 
387-401. 
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Now consider the transformed variables 

y*=My and X*=MX (6-54) 
From Eq. (6-54) it follows that y* is the vector of residuals after y has been 
regressed on D. Similarly, each column in X* is the vector of residuals after the 
corresponding X variable has been regressed on D. If y* is regressed on X*, the 
estimated coefficient vector is 

(X’M’MX)_'X’M’My = (X’MX)_'X’My 
= 
in view of the symmetry and idempotency of M. Thus we have the important 
result that if we partition the explanatory variables in a regression into two blocks 
denoted by 
[x D] 
the estimated coefficients of the X variables are exactly the same, whether we run 
the full OLS regression of y on X and D or first “correct” y and X for the effect of 
D and regress the Y residuals on the X residuals. More formally, if we calculate 
the two regressions 
y = Xe, + Db, + e, 
y* = X%c, + e, 
the result is 
Cr So (6-55) 

This result is, of course, symmetrical with respect to X and D, and the D matrix 
need not consist of dummy variables; it is merely any subset of explanatory 
variables. However, Lovell is concerned with seasonal adjustment, and D is then 


appropriately an n X 4 matrix of quarterly seasonal dummies. 
Two further basic results from Lovell are that the regressions 


y=X%;,+e, 
and y = X*c, + Db, + e, 
also yield identical vectors of coefficients for the X variables, that is 
LS (iene be ef (6-56) 


The proofs are simple. Regressing y on X° gives 
c, = (X’MX)'X’/My 
=¢ 
and regressing y on [XD] gives 
c,] _ | X’MX X’MD]|~'| X’My 
b, DMX DD Dy 


XMX 0. ]-'TxMy y 
=| 0 DD meal using Eq. (6-44) 
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Thus 
c, = (X’MX)'X’My 
ay 


These results raise some further questions. We have already seen that if D is 
merely a matrix of seasonal dummies, then y* and X‘, defined in Eqs. (6-54), are 
not properly deseasonalized series. On the other hand, if properly deseasonalized 
series are obtained by using the transformation matrix T defined in Eq. (6-50), 
this matrix, though idempotent and orthogonal to D, does not have the symmetry 
property used in the above proofs. Furthermore, many official series are not 
deseasonalized by least-squares methods at all, but by moving average or other 
methods. Thus the Lovell results cannot be expected to hold exactly when y* and 
X* indicate properly deseasonalized series. Nonetheless some experimental calcu- 
lations with various equations from the Oxford econometric model of the United 
Kingdom indicate agreement to several decimal places between estimated coeffi- 
cients, whether the regression has been run with raw data and dummy variables or 
with deseasonalized variables produced by moving average methods or by least- 
squares regressions on D or on {P_DJ.} The years covered by the model showed 
fairly steady growth and negligible cyclical oscillations. One would not expect 
such close agreement if the cyclical effects were very strong, and in practical work 
one should not allow this theorem to be a substitute for careful thought about the 
proper specification of the relationship. 


6-5 MULTICOLLINEARITY 


We have seen in Chap. 5 that the OLS estimator is 
b = (XX) 'X’y 
and that its variance matrix is 
var(b) = 02(X’X) 


Thus the sampling variances depend not only on the disturbance variance 0”, but 
also on the sample values of the explanatory variables. Consider the following 
hypothetical matrices. 


xx (xx)! |x’X| 
1 1 at [ 1 °| 1 
: 01 01 
1 09 5.26 —4.74 
= Fe 1 ajar 8526 ae 
1 0.99 50 al 0.02 
:: bes 1 -495 50 : 


+ A. Georgopoulou and J. Johnston, “Seasonal Adjustment of Economic Time Series,” University 
of Manchester, discussion paper. 
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In case | the two explanatory variables are orthogonal and the coefficients of the 
X’s in the multiple regression equation would be the same as those given by the 
simple regressions of Y on each X in turn. Orthogonal variables may be set up in 
experimental designs, but they are the exception, not the rule, in economic data. 
Cases 2 and 3 display increasing correlation between the two explanatory 
variables, as evidenced by the increasing numerical value for the off-diagonal 
(covariation) term. This is also reflected in the dramatic fall in the value of the 
determinant. This is described as a situation of collinearity (or multicollinearity) 
between the explanatory variables. Three important effects are illustrated in the 
sequence of matrices: 


1. The sampling variances of the estimated OLS coefficients increase sharply 
with increasing collinearity between the explanatory variables. Taking case | 
as the base, they are more than five times as great in case 2 and 50 times as 
great in case 3. Thus in any specific application individual coefficients are 
likely to differ substantially from their true values. 
Greater covariances between the explanatory variables produce greater sam- 
pling covariances for the OLS coefficients. Comparing the off-diagonal terms 
in X’X and (X’X)~' shows that a positive covariance for the X’s gives a 
negative covariance for the b’s, and vice versa. Again, in a specific applica- 
tion, if b, is below B,, b, is most likely to exceed B, and vice versa (provided 
the X’s are positively correlated). 
3. Small variations in the data (for instance, dropping or adding a few observa- 
tions) may produce substantial variations in the OLS coefficients. Suppose the 
normal equations for case 2 are 


b+ 0.9 =2.8 
0.9b) +b; =2.9 2 


S) 


=e Mines) 


Now suppose the X; variable is somewhat more highly correlated with X, and 
we have normal equations for case 3 as 


b, + 0.99, = 2.8 


(eT pay er ae ati dca a 


The only numerical change between the two sets of equations is a 10 percent 
(or less) increase in two coefficients, yet the solution values change dramati- 


+ Assuming the variables to be in deviation form 
xy 
p= [3 0 } xoY 4 XOX 
0 xx] [xy xy 
X5X3 


where each element in the right-hand-side vector is the slope coefficient in a simple two-variable 
regression. 
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cally.¢ Notice, however, that b, + b; = 3 in the first case and b, + b; = 2.964 

in the second. Thus the sum of the coefficients appears to be estimated fairly 

precisely, even though the individual coefficients are subject to large errors of 

estimation. Even this happy result is dependent on the covariance between 

the b’s being negative (i.e., covariance between the X’s being positive) for 
var(b, + b;) = var(b,) + var(b;) + 2cov(b,, b;) 


Thus increasing collinearity increases both var(b,) and var(b;). However, it 
also increases the numerical value of cov(b,, b,) and, provided this covariance 
is negative, var(b, + b,) may not increase at all. For example, case 2 gives 


var(b, + b;) = 2(5.26 — 4.74)0? = 1.040? 
and case 3 gives 
var(b, + b,) = 2(50 — 49.5)0? = 1.0007 


For simplicity these three important points have been illustrated for the case 
of two explanatory variables. It is important to establish that similar results hold 
for the k-variable case and to discuss how multicollinearity may be detected and 
what may be done about it. However, before doing that, we will discuss the 
limiting case of exact, or complete, multicollinearity. 


Exact Multicollinearity and Estimable Functions 
In the case of two explanatory variables exact collinearity is represented by 
X3 = ax, (6-57) 


where we are working with the variables in deviation form. Then 
Lac 
XX = | 
rx; ane: 


with |X’X| = 0 and p(X’X) = 1. This is simply a breakdown of the assumption 
that X has full column rank, and so we cannot obtain the unique OLS vector 
defined by 


b= | = (X’X)"'X’y 
The normal equations 
(X’X)b = X’y (6-58) 
however, will admit an infinity of solutions for 


vie rxy[2] 


+ This is only a hypothetical example, but the literature of applied econometrics is full of examples 
of small changes in the data base producing substantial changes in estimated coefficients. For one 
example, see J. Johnston, “An Econometric Model of the United Kingdom,” Review of Economic 


Studies, 29, 1961, pp. 29-39. 
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so that the rows of X’y exhibit the same linear dependence as the rows of X’X. 
The set of equations in Eq. (6-58) is thus consistent, and there is an infinity of 
solution vectors. Taking the first equation in Eq. (6-58), we have 
x3 (b, + ab,) = Ux,y 
and the second equation is 
abx}(b, + ab;) = aLxzy 
Both equations reduce to 
Ixy 
x2 
Thus no matter which arbitrary solution to Eqs. (6-58) we take, the linear 
combination b, + ab, will always have the same numerical value. We then define 
B, + a as an estimable function, where we notice that the a in the estimable 
function is the parameter defining the linear dependence between x, and x,. 
The same result may be derived by writing the model in deviation form as 
y = Bx, + Bx, + (u- a) 
and substituting Eq. (6-57) to get 


b, + ab, = (6-59) 
2 5 


y = Bx, + (u- a) (6-60) 
where 
B= B, + ap; (6-61) 
The £ parameter may be estimated by applying OLS to Eq. (6-60) to give 
pe (6-62) 
ix} 


which is the same expression as that already obtained in Eq. (6-59). The expected 
value of y for a given x, (and x;) is 


E(y) = Byx2 + Bx, 
= (B, + aB;)x> 
= Bx, 


Thus £(y) can be estimated uniquely since B can be estimated uniquely by Eq. 
(6-62). 


Example 6-9 Suppose x, = 2x, and the sample data are 


_f10 20 peeales 
am [3 40 xy =| 3] 


The normal equations (6-58) are 
10b, + 20b, = 5 
20b, + 406, = 10 
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Table 6-11 
Estimate of 
by by b, + 2b; E(y|x2 = 20) 
0 0.5 0.5 10 
1 pe) 0.5 10 
= 2.5 0.5 10 
—10 20.5 0.5 10 
with solution 
5 — 20b. 
b, =——— 
10 


Taking some arbitrary values for b, gives Table 6-11. 
The linear combination b, + 2b, is invariant to the solution chosen for 


the normal equations, and it is readily seen to be equal to 


Imy _ 5 


= Ex? 10> 0.5 


Likewise, the regression value for any given x, is invariant to the normal 


solution vector. 
To summarize, even though f, and 8; cannot be estimated, a certain 
linear combination of B, and 8, can be estimated, and E(y) can also be 


estimated for any given x, value. 


The nature of the problem may also be illustrated geometrically. In Fig. 6-5a 
the standard OLS case is shown. The x2,X, vectors are not perfectly collinear, 
and they span a two-dimensional subspace in ”. Dropping a perpendicular from 


(a) (b) 


Figure 6-5 
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y to that subspace splits y into 
y=jrte 
where 
§ = Xb = bx, + b5x; 

The regression vector § is a unique linear combination of the column vectors 
X,X 3. By contrast in Fig. 6-55 the x,,x, vectors only span a one-dimensional 
subspace (line) in 2”. The § vector is still unambiguously determined by dropping 
a perpendicular from y to the line, but } cannot be expressed uniquely in terms of 
x, and x;. 

In the general case perfect multicollinearity exists if p(X) < k. Suppose 
p(X) = r. There is then at least one set of r linearly independent columns in X. 
Let one such set be assembled in the first r columns, so that we partition X as 


X=([X, X,] , (6-63) 
where X, =n X r matrix of rank r 
X, = 7 X s matrix of the s = k — r remaining columns in X 


Each column vector in X, may then be expressed as a linear combination of 
the columns of X,, Thus we may write 


xX,=XW (6-64) 
where W is an r X 5 matrix, each column of which gives the coefficients of the 
linear combination for the corresponding vector in X,. The numerical values of 


the elements of W can, in principle, be determined. Combining Eqs. (6-63) and 
(6-64) gives 


X=X,[I, W] 
=XZ (6-65) 
where 
Z=[l, w] (6-66) 
The linear model may then be written 
y=XB+u 
=X,ZB+u 
=XB+u (6-67) 
where 
B, = ZB (6-68) 


Note carefully that B, indicates a vector of r linear combinations of the original B’s, 
the coefficients of those linear combinations being given by the rows of Z, as 
defined in Eq. (6-66). The elements of B, may be estimated by applying OLS to 
Eq. (6-67), since the X, matrix has full column rank. The estimator is thus 

B, = (X,X,)'X;y (6-69) 
Likewise, E(y) can be estimated by the regression vector 


9=x,B (6-70) 
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and the estimates in Eqs. (6-69) and (6-70) will have the usual OLS properties. 
The operational procedure would be to identify the largest submatrix in X with 
full column rank, denote it by X,, and substitute in Eqs. (6-69) and (6-70). If 
there is more than one such submatrix, § will be invariant to which is chosen. 

As in the case of two explanatory variables, an alternative procedure is to 
derive any solution by to the normal equations 

(X’X)by = X’y 
and compute 
§ = Xb (6-71) 

The numerical values for 9 in Eqs. (6-70) and (6-71) will be identical. One needs 
to determine the linear dependencies in the X data (as in the W and Z matrices) 
in order to determine the precise linear combinations of B coefficients that are 
being estimated in B,, but such combinations are not usually of any economic 
significance. Furthermore, the use of Eq. (6-70) or Eq. (6-71) for forecasting 
outside the sample observations rests on the same linear dependencies holding 
among the X’s in the forecast period. 


Near Multicollinearity 


The prevalent case in so much econometric work, especially with time series data, 
is one of high but not exact multicollinearity. This raises three questions: 


1. What effects to expect from multicollinearity 
2. How to detect the degree of multicollinearity 
3. What remedial action to take 


Effects 


Provided the X matrix has full column rank, the OLS estimates exist and will still 
be the best linear unbiased estimates. This property, however, is now cold comfort 
since the sampling variances of the estimates increase alarmingly with rising 
collinearity. To prove this in the general case, partition the X matrix as 


Ke= [xp kyl 


where x,=column vector of observations on the ith explanatory variable 
X,=submatrix of observations on the k —1 other explanatory variables 


Then 
. xix; xX; 
ORS ey XP, 
Applying Eq. (4-68), the leading term in (XX)! is 


[(w7x,) = x7X,00%,)'0x,)] = (Mx) 
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where 
M, = 1 — X,(X;X,)'X; 
Thus the sampling variance of the OLS estimate of 8; is 
2 
o 
var(b,) = S7MEx, 


ipsret 


(6-72) 


But from Eq. (5-54) it is seen that 


residual sum of squares from the regression of the jth explanatory 


MAS variable on the other k — | explanatory variables 


The residual sum of squares decreases with increasing collinearity between the ith 
explanatory variable and the remaining explanatory variables, and thus the 
sampling variance of b, increases. It is clear from Eg. (6-72) that not all 
coefficients will be affected similarly by collinearity. The denominators in the k 
sampling variances are the residual sums of squares from the multiple regressions 
of each explanatory variable in turn on all the other explanatory variables, and 
these can vary considerably from one to another, as is illustrated in the following 
numerical examples. 

Suppose we have three explanatory variables X,, X;, and X,, all measured in 
deviation form. We show four illustrative X’X matrices, and in each case the 
determinant, the inverse, and the values of the squared multiple correlation 
coefficients obtained when each explanatory variable in turn is regressed on the 
remaining explanatory variables. 


XO (xX)! R334 R324 Ria 
10: 0110 OP(O. tO 
1 0.5 <0 0: | 90.2! 140) 0 0 0 
Oui Oe) 1 
det = 50 


IO) Sine 2 0,2 yi 032, 0 

7} SP pater, 0.2 1.2  ~2.0} 0.5000 0,8333 0.8000 
10 6 3 1.0 0 30 

S Gir Shee 0 1.0 -2.0] 0.9000 0.8000 0.9286 


10 7 5 5.3) R-Tk = 0:7 
4. 7 a! SS WE) 0.7} 0.9812 0.9802 0.2500 


det = 0.75 
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Case 1 shows perfectly orthogonal variables. The sampling variances are given by 
o? times the elements in the principal diagonal of (X’X)~!. In the orthogonal case 
these variances are inversely proportional to the amount of variation in the 
corresponding explanatory variable. For example, X, displays 10 times the 
variation of X,, and the sampling variance of its coefficient is one-tenth of that 
for the coefficient of X,. To standardize the comparisons, the elements in the 
principal diagonal of X’X have been kept constant throughout, but the cases 
display increasing collinearity, as reflected in the declining value of the determi- 
nant. In case 2 the sampling variances are all larger than in case 1, but they still 
retain the same order in that 


var(b,) < var(b;) < var(b,) 


However, they have been increased by varying factors. We notice that var(b,) is 
now 25 times var(b,), contrasted with 10 times in the orthogonal case. Case 3 
shows a large increase in var(b,), which is now as large as var(b,). Case 4 is the 
most interesting of all in that var(b,) is now the smallest of the three sampling 
variances, and indeed it is not much larger than in the orthogonal case, whereas 
var(b,) and var(b,) are each more than 50 times as large as in the orthogonal 
case. 

Careful study of the R?’s will show that there is an association between the 
size of R? and the extent to which the corresponding sampling variance is 
increased over the orthogonal case. The relationship is, in fact, a precise one and 
may be set out as follows: Let 


TSS, = total sum of squared deviations for X; 
RSS, = residual sum of squares when X; is regressed on the other k — 1 
explanatory variables 
R? = square of multiple correlation coefficient from the same regression 
esha SES, 
TSS, 
Then Eq. (6-72) may be rewritten as 


2 o 


o 
var() = RSS ~ TS5,(1 — R?) 


Letting b,, denote the estimate of A; in the orthogonal case, 


o 


var(b;,,) = Tss, 


Thus, if TSS, is held constant, the magnification of the sampling variance with 
increasing collinearity is given by 
var(bj) 1 (6-73) 
var(bj,) 1 — Rj 


The orthogonal case is not meant to be a feasible target, but is used as a 
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Table 6-12 Magnification of sampling variances 


R? 05 08 0.9 0.95 0.96 0.97 098 0.99 0.999 
lb) 9 5 19 2 25 -33,-50—:100_—1000 
var( bio) 


v 


benchmark from which to measure the relative magnification of the sampling 
variance of different coefficients. Some illustrative calculations from Eq. (6-73) are 
shown in Table 6-12, and the graph of the function is drawn in Fig. 6-6. 

As the table and the figure show, the relationship is highly nonlinear, and the 
magnification factor increases dramatically as R? exceeds 0.9. The formula also 
reveals why different coefficients fare differently in a regression. For example, in 
case 4 the magnification effect was very serious for b, and 5, and almost negligible 
for b,, which is exactly in line with the pattern of R?’s. In that data X, and X;, are 
highly correlated with one another, but X, is not closely correlated with either X, 
or X;, or with any linear combination of them, 

The three main effects already listed for the case of two explanatory variables 
will thus carry over to the general case, namely 


« Very large sampling variances 
¢ Greater covariances 


«Great sensitivity of estimated coefficients to small data changes 


A common result is to find regressions possibly with a very high overall R?, but 


var (b;)/var (bio) 
’ 


100 + 


80 


40+ 
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with some (or many) individual coefficients apparently insignificant. The high R? 
arises when the y vector is close to the hyperplane generated by the x, vectors and 
the apparently insignificant coefficients arise because the x, vectors are nearly 
linearly dependent. It is also possible to find a high R? and highly significant ¢ 
values on individual coefficients, even though multicollinearity is serious. This can 
arise if individual coefficients happen to be numerically well in excess of the true 
value, so that. the effect still shows up in spite of the inflated standard error 
and/or because the true value itself is so large that even an estimate on the 
downside still shows up as significant. The multicollinearity would likely show up 
in varying parameter estimates as some sample observations are dropped or 
added. For any regression, however, comparison of the R?’s shows which 
coefficients are likely to be most seriously affected by collinearity. 


Detection 


Computer programs often print out |X’X|. As our numerical examples illustrate, 
the determinant declines in value with increasing collinearity, tending to zero as 
collinearity becomes exact. While a useful warning signal, we have no calibration 
scale for assessing what is serious and what is very serious, and again it gives no 
guide to the relative effects on individual coefficients. Similar remarks apply to the 
computation of the eigenvalues of X’X. Since 


[XX] =A,Az “+ Ay 
a small determinant means that some (or many) of the eigenvalues will be small. 
But again knowledge of the eigenvalues is of little direct help in assessing effects 
on individual coefficients. 
The most useful single diagnostic guide is the R?’s, as shown above. In a 
sense TSS, determines the minimum sampling variance that might be achieved for 
b, in that in the orthogonal case 


vl) = 


Any collinearity in the sample data will raise all sampling variances, but the 
relative magnifications for different coefficients will be indicated by a comparison 
of the R?’s. 

Belsley, Kuh, and Welsch suggest the combined use of two diagnostic tools to 
detect which coefficients are most likely to be affected by the collinearity.} The 
first statistic is the condition number of the X matrix, defined by 


i 


«(X) = 


+ The precise relationship between var( b,) and the A’s is derived below. 
£D. A. Belsley, E. Kuh, and R. E. Welsch, Regression Diagnostics, Identifying Influential Data and 


Sources of Collinearity, Wiley, New York, 1980, chap. 3. 
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where A,,,, and A,,;, denote the maximum and minimum eigenvalues of X’X, 
respectively. If the X matrix has been standardized so that each column has unit 
length, then «(X) is unity when the columns of X are orthogonal and rises above 
unity with collinearity between the columns. Various applications with experi- 
mental and actual data sets suggest that condition numbers in the range of 20 to 
30 are probably indicative of serious collinearity problems, and a fortiori for 
numbers in excess of that range. A condition index may be computed for each 
eigenvalue, starting at unity for A, = A,,,, and rising to (X) for A; = A,,,.,- Thus 
a given data matrix may yield one or more condition indexes in excess of a 
“danger” level. The second and related diagnostic tool is the regression coefficient 
variance decomposition. If X is n X k and V is the k x k matrix that diagonalized 
X’X, then 
(X’X)V = VA 
where A is the diagonal matrix of the eigenvalues of X’K. Thus 


var(b) = 02(X’X) | = o2VA~'v’ 


and 
2 2 
on, 2 0; 
var(b,) = 07/4 + 2 4... 4 ik = 
(b;) Rtas * UR Sik 
where 0,,, 0j2,..-, 0, are the elements of the ith row of V. From this one may 


compute the proportions of var(b,) associated with each \. The two-step procedure 
recommended by Belsley, Kuh, and Welsch is 


1, Compute the A;’s and identify any \, which gives a condition index in excess 
of the “danger” level (say, 20 to 30). 

2. For each of those selected A,’s inspect the proportions of the sampling 
variance of each 6, associated with that eigenvalue. Coefficients with propor- 
tions in excess of, say, 0.50 are likely to have been adversely affected by the 
collinearity in the X matrix. Reference should be made to Belsley, Kuh, and 
Welsch for detailed examples of the technique. 


Remedies 


More data is no help in multicollinearity if it is simply “more of the same.” What 
matters is the structure of the X’K matrix, and this will only be improved by 
adding data which are less collinear than before. However, there is often no easy 
way for an econometrician to get better data. The data are produced by the 
functioning of the economic system, and the collinearities reflect the nature of 
that system. One hopeful approach in some areas is the joint use of both 
time-series and cross-section data, which we will take up in Chap. 10. A related 
approach is to feed in estimates of some parameters which may be available from 
an independent, relevant study. The classic example is the analysis of demand 
functions, where an estimate of the income elasticity obtained from cross-section 
studies is fed into the estimation of the price elasticities from a time-series sample. 
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This is a sequential approach using one set of cross-section data and another set of 
time-series data, rather than a joint (or simultaneous) set. The latter requires that 
the observations relate to a common set of decision units. 

The general framework for the incorporation of prior estimates of some 
parameters may be set out as follows. Partition X, B, and b as 


x=[x, X,] Ta |e] 


where X, is the m X r submatrix consisting of the first r columns of X, X, is the 
submatrix consisting of the remaining s =k — r columns, and B and b are 
partitioned conformably. Suppose that a previous study provides the estimated b, 
vector with an estimated variance matrix V,. We will assume b, to be unbiased, 
that is, 

E(b,) = B, 
and we will take V, to be approximately the true variance matrix 


E{(b, — B,)(b, ~ B,)’) 


The problem now is to estimate the remaining unknown parameters in B,. The 
procedure is to “correct” y for the X, data by forming 


Ye =¥ — X,b, (6-74) 
and then perform an OLS regression of ys on X,. The result is 
b, = (X,X,) "XY (6-75) 


Writing 
y=X,B,+X,B +0 


and substituting in Eqs. (6-74) and (6-75) gives 
b, = B + (X;X,) ‘Xiu — (X,X,) 'X;X,(b, — B) 
Taking expectations 
E(b,) = B, 
since E(u) = 0 and E(b,) = B,.7 Then 
var(b,) = E{(b, — B,)(b, ~ B,)) 
= 02(X/X,) '+ (X-X,) 'XX,V,X,X,(X,X,) | (6-76) 


on the assumption that the two sets of data are independent. The first term in Eq. 
(6-76) is the conventional variance matrix for an OLS regression involving X,, 
and the second term shows the elements by which this must be adjusted because 
of the sampling variation in the b, coefficients, used in calculating ys. The only 


+ Notice that this operation involves taking expectations over two different sets of data. E(u) refers 
to expectations over the current sample data and E(b,) to expectations over the data underlying the 


prior estimate b,. 
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remaining practical problem is the estimation of o?. Defining 
e= yx — X,b, = y — X,b, — Xb, 


the estimate is e’e/(n — k), where we divide by n — k rather than by n — r since 
e depends on k estimated parameters, 

Some authors suggest dealing with multicollinearity in a rather mechanical 
and purely numerical fashion. For example, a currently fashionable technique is 
that of ridge regression.} The ridge estimate of B is defined as 


by = (XK + cl) 'X’y (6-77) 


where ¢ > 0 is an arbitrary constant. The rationale for the estimator may easily be 
seen by referring back to the X’X matrices given in the numerical illustrations. 
Increasing the diagonal elements and leaving the off-diagonal elements unchanged 
may be expected to reverse the sequence of effects shown in those examples where 
the off-diagonal elements have been increased relative to the diagonal elements. It 
follows directly from Eq. (6-77) that 


E(bp) = (XX + cl) 'X’XB (6-78) 
and 


var(by) = 07(X’X + cl) 'X’X(X’X + cl)! (6-79) 


The ridge estimator is thus biased, but it may be shown that the variances of the 
elements of bp are less than those of the OLS estimator.t This raises the 
possibility that a ridge estimator may have a smaller mean-square error (MSE) 
than the OLS estimator.§ The main difficulty centers on the selection of a 
numerical value for the arbitrary scalar c. In their original article Hoérl and 
Kennard suggested trying various values of c in an attempt to see if the by vector 
stabilized. Schmidt, in the source cited, establishes conditions for ¢ to minimize 
E(bp — B)'(be — B), the sum of the MSEs of the ridge estimators. These condi- 
tions, however, depend on unknown parameters. Using sample estimates of these 
parameters to determine c would yield an estimator with complicated and as yet 
unknown sampling properties, so that inferences about B could not be made. The 
ridge technique essentially consists of an arbitrary numerical adjustment to the 


sample data, and one does not really know how to interpret the resultant 
estimators.]] 


} See A. E. Hoerl and R. W. Kennard, “Ridge Regression: Biased Estimation for Non-Orthogonal 
Problems,” Technometrics, 1970, pp. 55-68, for an exposition of the theory; and A. E. Hoerl and 
R. W. Kennard, “Ridge Regression: Applications to Nonorthogonal Problems,” Technometrics, 1970, 
pp. 69-82, for two illustrations. 

See P. Schmidt, Econometrics, Marcel Dekker, New York, 1976, pp. 48-55, for this result and a 
very useful discussion of the theory of ridge regression. 

§ For a definition of MSE see Chap. 2, pages 27-28. 

{| For further discussion see the series of papers in Journal of the American Statistical Association, 
75, 1980, pp. 74-103. 
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Another approach to improving the MSE involves the suggestion that one or 
more explanatory variables be dropped in order to improve the MSE of the 
remaining coefficients. To illustrate the approach consider just a three-variable 
model, 

y = Box, + Bsx3 + u (6-80) 
where the variables have been expressed in deviation form.t Let us denote the 
coefficients resulting from the application of OLS to Eq. (6-80) as 


b,.3 = OLS estimate of 8, 
b,; = OLS estimate of 8, 


From the properties of the OLS model we know these estimators are unbiased 
and their sampling variances aret 
2 


o 
var(b123) = Exi(1 = 7B) (6-81) 
o2 
var(b)32) = Eat 7B) al (6-82) 


Clearly, as r3, gets close to unity, both sampling variances increase dramatically. 
Now consider the simple regression of y on x2 and denote the slope coefficient by 


Lyx. 
= 
ie Lx} 
Substituting for y from Eq. (6-80) gives 
Ix .u 
by = By + by2By + ez (6-83) 
2 
where 
Lx2X3 
to ae 
32 Dx 


+ Strictly speaking, when the relation is written in deviation form, the disturbance term is u — iu, 
but this slight complication has no effect on any of the derivations in which we are interested and so 
may be ignored. 

+ These are derived from the general formula 


3 
bios u ey Se Ix3 0 —Ex2x3 
var ‘i = 07(XX) eae Zee Ex 
13.2 Lx3Lx} — (Ex2x3) 2: 
which gives 
o?Ex3 a 


Pee = 
var(s) Exdix? — (Expx)?  De3(1- 5) 


where 7; is the simple correlation coefficient between x, and x3. A similar derivation yields the result 


for var(b,32)- 
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From Eq. (6-83) it follows that 


E(b,2) = B, + by; (6-84) 
andt 
oe 
var( bya) = 5 (6-85) 


Thus b,, is a biased estimator of B,, unless x, and x, are orthogonal so that 
b,, = 0. However, comparison of Eqs. (6-81) and (6-85) shows that b,, has a 
smaller sampling variance than b,,,. The possibility then exists of a tradeoff 
between bias and variance. The crucial question is under what conditions b,, may 
have a smaller MSE than b,,;. 
As shown in Eq. (2-23), 
MSE = sampling variance + square of bias 

Thus 


2 
o 
MSE(b,,) = — + 62,82 
( n) Ex? 32P3 
and 
MSE(b,, ;) a 
Pr apexetl = rd) 


A little algebra then showst 


MSE(b,,) 
So = ltr? = 1 6-86 
MSE(5,,3) 1 * (7 — 9) te 
where 
2 BE = BE (6-87) 


~ G?/Ex(1— 13) var(Biy3) 


This 7? statistic is the ratio of the square of the true (but unknown) £; to the true 
(not the estimated) variance of b,,,. From Eq. (6-86), if 72 < 1, 


MSE(b,,) < MSE(b,, ;) 


Thus if one were mainly interested in obtaining as accurate an estimate as 
possible of ,, and if one felt confident that +? was less than unity, it might seem 
sensible to drop x, from the regression and carry out a simple regression of y on 


+ Notice that, in this case, var(b,2) is the variance about a biased expectation. From first principles 


var( bj) = E{[b. = E(6,2)')} = E{[by2 — B,- bsBs]"} 


ars (Ex) “ a? 
dx Ix} 


+ See Problem 6-9. 
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X>, The snag of course is that 7? is unknown, and the consequences of mistakes 
about its value could be fairly serious. If, for example, B; is three times its true 
standard error and rJ, is around 0.8, then MSE(5,>) will be over seven times as 


large as MSE(5,23)- 
In view of Eq. (6-86) it might seem plausible to drop x, from the regression if 
the estimated t value is numerically less than 1, that is, if 


g th Sn 
s?/Ex}(1 — ri) 
Thus one may define a conditional omitted variable (COV) estimator of B, as 


bis ifF <1 
b, = 
cov |b, = fF 21 


F=f <1 (6-88) 


(6-89) 


Other COV estimators may be defined using critical F values other than unity. 
Feldstein has investigated the MSE of bcoy relative to MSE(b,,;) for various 
values of ry, various values of , and also for several critical F values, including 
unity and the conventional F) 95.+ When |r| > 1, sampling fluctuations can still 
give F <1, and consequently the COV estimators are inferior to OLS. 


Feldstein’s main conclusion is 


OLS is preferable to any of the COV estimators unless the researcher has a 
strong prior belief that t < it 


Feldstein also investigates the properties of a weighted (WTD) estimator 
which is simply a linear combination of bj, and by) 3. We define 
by = Abas + (17 A)bn (6-90) 
It may be shown that the value of A which minimizes MSE (bwrp) is§ 


rN a (6-91) 

t= =| 
1+7? 

This is the same unknown 7? statistic already defined in Eq. (6-87). The WTD 

estimator could be made operational by computing the r? statistic defined in Eq. 

(6-88), hence computing 


ia 


eS 
1+? 
and substituting this value of A in Eq. (6-90). Feldstein’s simulation experiments 


show the WTD estimator to be generally superior to the various cov estimators 
in his study, but to be inferior to OLS when |r| > 1.5. Thus exhaustive study of 


+M. S. Feldstein, “Multicollinearity and the Mean Square Error of Alternative Estimators,” 
Econometrica, 41, 1973, pp. 337-346. See especially Tables I, I, and IIT. 


$M. S. Feldstein, op. cit., p. 344. 
§ See M. S. Feldstein, op. cit., or work it out directly in Problem 6-10. 
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the three-variable case suggests that, even in the presence of high correlation 
between x, and x3, the best procedure is probably the straightforward OLS 
regression of y on x, and x,. Only if the investigator has really strong prior beliefs 
that B; is less than (var(b,;2) , Should x, be dropped from the regression. Even 
this nonstartling advice to drop a variable when you are fairly sure its coefficient 
is “small” is only helpful if the investigator is mainly interested in the other 
coefficient, B,. 

Even though these results on the three-variable case are not very helpful, 
considerable work has been done on extensions of the approach to the k-variable 
case. As we have already seen in Chap. 5, setting a coefficient or group of 
coefficients at zero is a special case of imposing a set of linear restrictions on the 
coefficients. Thus the question arises whether the imposition of a set of restric- 
tions will result in estimators which are better in some MSE sense than the 
unrestricted OLS estimators, even though the restrictions may not, in fact, be 
true, 

The first problem is the generalization of the MSE criterion to a number of 
estimators, Consider the usual linear model 


y=XBp+u 
with the set of q (< k) restrictions embodied in 
RB=r 
As seen in Eq. (6-5), the estimator embodying these restrictions is 


by = b+ (XX) 'R’[R(X’X)"'R’] "(r= Rb) 


where 
b = (XX) xy 
is the unrestricted OLS estimator. We may define the MSE matrix for b, as 
MSE(b,) = E((b, — B)(b, — B)’} (6-92) 


This is a symmetric k x k matrix with the MSEs of the individual coefficients 
displayed on the principal diagonal. The typical off-diagonal term is 
E(b«:— B,)(bs; — B,) (sel? 


which is essentially a covariance defined in terms of the true B,, B values rather 


than in terms of the expected values of the estimators, One might then say that b, 
is better in MSE than b if 


¢’MSE(b,)c < c’MSE(b)c (6-93) 
for any nonnull k-element vector ¢.} This is a very strong criterion, requiring that 


¢ Notice that Eq. (6-93) is equivalent to the condition MSE(c’b,) < MSE(c’b) for 
MSE(c’b,) = E{c'(bs — B)’) = E{e’(b, — B) (by — B)‘c) = “MSE(b, )e 


and similarly for MSE(c’b). Thus Eq. (6-93) requires that the MSE of any linear combination of the 
elements of b, be no greater than the MSE of the same linear combination of the elements of b. 


FURTHER TOPICS IN THE k-VARIABLE LINEAR MODEL 257 


any quadratic form in MSE(b,) be less than or equal to the corresponding 
quadratic form in MSE(b). A much weaker criterion would be 


tr MSE(b,) < tr MSE(b) (6-94) 


that is, that the sum of the MSEs of the restricted estimators be less than or equal 
to the sum of the MSEs of the unrestricted estimators. 

The problem of determining the conditions under which Eq. (6-93) or Eq. 
(6-94) might hold has been investigated in a number of papers by Wallace, 
Toro-Vizcarrondo, and Goodnight.} It can be shown that the restricted estimators 
(even when the restrictions are incorrect) will have smaller variances than the 
unrestricted OLS estimators. However, taking expectations of -Eq. (6-5) shows 
that 


E(b,) = B + (XX) 'R[R x) 'R']'(¢ - RB) 


so that b, will be a biased estimator if the restrictions are not correct. This is a 
generalization of the tradeoff between bias and variance in the previous simple 
example. Wallace and Toro-Vizcarrondo show that the strong MSE criterion 
(6-93) will be satisfied iff 


U vr —ly, ail = 
; jeune] Cramp 4 sh 


As in the previous simple case, this condition involves the true but unknown B 
vector and the unknown o”. If these are replaced by their OLS estimators and the 
resultant value of the statistic in Eq. (6-95), denoted by A, it is easy to see that 


24 
q 


+C. Toro-Vizcarrondo and T. D. Wallace, “A Test of the Mean Square Error Criterion for 
Restrictions in Linear Regression,” Journal of the American Statistical Association, 1968, pp. 558-572; 
T. D, Wallace and C. E, Toro-Vizcarrondo, “Tables for the Mean Square Error Test for Exact Linear 
Restrictions in Regression,” Journal of the American Statistical Association, 1969, pp- 1649-1663; 
T. D, Wallace, “Weaker Criteria and Tests for Linear Restrictions in Regression,” Econometrica, 40, 
1972, pp. 689-698; J. Goodnight and T. D. Wallace, “Operational Techniques and Tables for Making 
Weak MSE Tests for Restrictions in Regressions,” Econometrica, 40, 1972, pp. 699-709. 

+ Notice that dropping x; from the model 


y = Bx. + Byx3 + 4 


i) ile] -° 


and with these specifications of r and R condition (6-95) becomes 


is equivalent to imposing the restriction 


ort? < 1 as derived in Eq. (6-87). 
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where 
pu Weeea = €0)/4 
e’e/(n — k) 
is the sample statistic, defined in Eq. (6-8), for testing the null hypothesis 
Hy: RB=r 


When A, is true, A = 0, and F has the central F distribution with q,n — k degrees 
of freedom. The test of Hg is made, as we have seen, by comparing the sample F 
with a preselected critical value from the central F distribution. The basic result 
of Toro-Vizcarrondo and Wallace (1968) is that when Hy is not true, the F 
Statistic, defined above, follows the noncentral F distribution with degrees of 
freedom q, n — k and noncentrality parameter ), defined in Eq. (6-95). Thus the 
test for the improvement in MSE is to compare the sample F statistic with a 
critical value from the noncentral F distribution with \ = 0.5. Critical points of this 
distribution are tabulated in Wallace and Toro-Vizcarrondo (1969). The practical 
procedure is as follows: 


1. Compute the usual F statistic, based on the difference in the residual sums of 
Squares from the restricted and unrestricted regressions. 

2, If F > F(q,n — k)ogs, say, in the table by Wallace and Toro-Vizcarrondo, 
reject the hypothesis that the restricted estimators are better in MSE. If the 
sample F is less than the critical value, use the restricted estimators. 


The above procedure is for the strong MSE criterion, embodied in Eq. (6-93). 
Wallace (1972) has shown that the weaker MSE criterion (6-94) will be satisfied if 
ft: 
A< 3 
where q is the number of restrictions. The appropriate critical values of F are 
tabulated in Goodnight and Wallace (1972). As an indication of how these 


procedures would work, consider the following critical F values, taken from the 
appropriate tables: 


Noncentrality parameter 
d=0 d=05 A=q/2 


F(3,20) 995 3.10 4.06 5.73 


If one were testing Hy: RB =r at the 5 Percent level with g = 3 and 
n — k = 20, Hy would be rejected if the sample F exceeded 3.10, and one would 
conclude that the restrictions were not true. However, a sample F as high as 4.06 
in the case of the strong MSE criterion, and as high as 5.73 in the case of the 
weak MSE criterion, would still lead to the imposition of the restrictions and the 
use of the restricted estimator on the Wallace criterion. 
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These procedures, as in the COV estimator of Eq. (6-89), rest on a prior 
significance test. Their actual performance in repeated applications would need to 
be evaluated as Feldstein did for the COV estimator in the three-variable case. It 
is also doubtful whether, in practice, econometricians would wish to impose 
restrictions, which seem unlikely to be true, in order to improve the estimators in 
the MSE sense. For example, suppose the estimation of a production function 
leads to the rejection of the hypothesis of constant returns to scale, but the sample 
F value does not exceed the critical value for the weak MSE condition. The 
restricted estimates of capital and labor elasticities may have lower MSEs than 
the unrestricted estimates, but they will sum to unity and incorrectly indicate 
constant returns to scale. The investigator must make a value judgment as to 
whether this kind of tradeoff is desirable. 

The upshot of this discussion of multicollinearity is not very comforting. 
Some data sets contain very little information and do not enable one to disentan- 
gle the relative effects of variables with much precision. The multiple correlation 
coefficients among the explanatory variables will indicate those coefficients which 
are likely to be most adversely affected by the collinearity, and one should not 
readily drop these variables from a regression because of low f statistics. Re- 
stricted estimators may have greater precision, but at the cost of a bias. Only the 
accumulation of more and better data sets will yield more precise estimates of 
complex interrelationships. 


6-6 SPECIFICATION ERROR 


Strictly speaking the term specification error covers any mistake in the set of 
assumptions underpinning a model and the associated inference procedures, but it 
has come to be used particularly for errors in specifying the data matrix Xt 
There are two problems involved in specifying X. The first is knowing which 
variables (such as income, relative prices, etc.) to include, and the second is in 
what mathematical form each variable is to be included. So far we have blithely 
assumed such knowledge to be readily available. In practice it is not. Economic 
theory can normally indicate the set of explanatory variables corresponding to 
any assumed model (utility maximization, cost minimization, etc.), but theory 
cannot usually indicate the precise form of the relationship. In less favorable 


situations where there is no clearly articulated theory there may be no clear guide 
to relevant explanatory variables. On top of all this one may not be able to obtain 
measurements on appropriate variables and, hence, have to use proxy variables in 
their place. 
To establish the effects of misspecification of X, let us suppose that the true 
model is 
y=XB+u (6-96) 


+See H. Theil, “Specification Errors and the Estimation of Economic Relationships,” Review of 
the International Statistical Institute, 25, 1957, pp. 41-51. 
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with 
E(u)=0 and E(u’) =o7! 
The model specified by the investigator is 
y=X,B+u (6-97) 


where, of course, some variables may be common to both X and X,. The 
investigator thus computes the estimated coefficient vector 


by = (X4X4)'X4y 
Substituting for y from Eq. (6-96) gives 
by = (X4X4) 'X4XB + (XX) ‘Xu 
Thus 
E(by) = (XX) 'X,XB (6-98) 


and the expectations of the estimated coefficients are seen to be not the true 
population parameters but rather linear combinations of those parameters. We 
may distinguish a number of different possibilities. 


Case 6-1: Exclusion of relevant variables Suppose that the X, and X matrices are 
X =[x, x, +--+ x,) =X, 
X= [x, MS py cits x, =[X, X,] 


The investigator has correctly included the first r explanatory variables but 
mistakenly omitted the remaining k — r variables. It follows directly that 


(XAX4) “XX = (X}X,)'[X}X,_X/X,]} 
= [1 X,)"x;x,] 
Thus 
E(bs;) = B+ 4,741 ep tt + ay B, Ke hg. if, 

where a, ,,1,-.., 4; , are the elements in the ith row of (X{X,)7'X{X,. The 
columns of this last matrix are seen to be the OLS coefficients obtained when each 
excluded variable in turn is regressed on the set of included variables. Thus even 
though the investigator has managed to include a number of the true explanatory 
variables, their coefficients will be biased, and the bias is seen to be some linear 
combination of the true coefficients of the excluded variables. This of course 
destroys the conventional b.lu.e. property of OLS estimators, The conventional 
inference procedures are also undermined, not only because of Eq. (6-98), but 
also because the disturbance variance cannot be correctly estimated. When y is 
regressed on X, = X,; =[x, x, --- x,], the residual vector is Myy, where 


My =1— X,(XiX,)"'X, 
is a symmetric idempotent matrix of rank and trace equal to n — r. The residual 
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sum of squares is 
RSS = y‘Miy 

Writing Eq. (6-96) in partitioned form as 

y= X,B, + XB, tu 
and substituting in RSS gives 

RSS = (XB) + u)'M,(X_B, + u) 
since M,X, = 0, and so 
RSS = uM, + B)X5M,X,B, + 2B;X,Mu 
Thus 
E(RSS) = E(u’Mu) + B3X5M,X2B, 
= 07(n —r) + B;X,M, XB, 


and so 


RSS 1 
E(- = -) = 0? + 1 BiX,M,X2B, (6-99) 
The matrix of the quadratic form in Eq. (6-99) is the matrix containing the sums 
of squares and the cross products of the residual vectors obtained when each 
excluded variable in X, is regressed on the set of included variables X,. Apart 
from a constant divisor it is a variance-covariance matrix and thus positive 
semidefinite, so Eq. (6-99) establishes that the residual variance estimated from 
the specified regression of y on X, will, on average, overestimate the true 
disturbance variance. As in Eq. (6-98), the bias involves the true but unknown 
coefficients of the excluded variables. The bias in the regression coefficients would 
disappear if the included and excluded variables were orthogonal, X’, X, = 0, but 
the estimated disturbance variance would have expectation 


RSS Ve, needa nets i 
(2S) =0 + —*BiX;Xab > 9 


so that faulty inferences would still be made. 


Case 6-2: Inclusion of irrelevant variables The X, and X matrices could now be 
specified as 
X, = xy, X,] 
x= [X,] 

where X, isn X k and X, (the matrix of irrelevant variables) is n X s. When each 
true variable in X, is regressed on [X, X,], the least-squares fit will force the 
coefficient of that same variable on the right-hand side to unity and all other 
coefficients to zero. Thus 


(X4Xe)" "XX = [* 
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where 0 is a null matrix of order s x k. Thus the coefficients of the variables in X, 
will be unbiased estimates of the true parameters, and the coefficients of the 
variables in X, will have zero expectations. The residual variance will also be an 
unbiased estimate of 0”. The residual sum of squares from the regression of y on 
X, is 
RSS = y’My 
where 
M =1-X,(X,X,)'X, 
Since the true model is, by assumption, 
y=X,B,+u 
RSS = (X,B, + u)’M(X,B, + u) 
= u’Mu + 2B;/X{Mu + B/X{MX,B, 
= uMu 
since MX, = 0 as MX, = M[X, X,] = 0. Thus 
E(RSS) = o?trM 


=(n-k-s)o? 
and so 
E(s?). =? 
where 
2 ui RSS 


2 es, 
Wake), 
Notice that although the true model only contains k variables, the correct divisor 


in s? ism — k — s, where k + 5 is the number of variables actually included in the 
misspecified model. 


It would seem from the discussion of these two cases that it is more serious to 
omit relevant variables than to include irrelevant variables since in the former 
case the coefficients will be biased, the disturbance variance overestimated, and 
conventional inference procedures rendered invalid, while in the latter case the 
coefficients will be unbiased, the disturbance variance properly estimated, and the 
inference procedures will be valid. This constitutes a fairly strong case for 
including rather than excluding variables from a Tegression equation. There is, 
however, a qualification to this view. Adding extra variables, be they relevant or 
irrelevant, will lower the precision of estimation of the relevant coefficients. This 
point has already been illustrated for a simple model in the previous section on 
multicollinearity. Suppose the true model is 5 


Y= Bx, +u (6-100) 
and the assumed model is 
Y= Bx + B3x, +u (6-101) 


FURTHER TOPICS IN THE k-VARIABLE LINEAR MODEL 263 


fhe sampling variance of the estimate of B, obtained by applying OLS to Eq. 
(6-101) is, as shown in Eg. (6-81), 


o 


Ex3(1 7 ra) 


whereas the correct sampling variance, under Eq. (6-100), is 0°/Dx3. More 
generally if X, indicates the set of true explanatory variables, X, the set of 
irrelevant variables, and if y were regressed just on X,, the variance matrix for the 
estimated coefficients would be 


vat(b,) = 


0°(X,X,) | (6-102) 


When y is regressed on [X; X,], the variance matrix for the coefficients of the 


variables in X, is 
bel 


o7(X,X, — XX, (XX) 'X)X,) (6-103) 


The diagonal elements in Eq. (6-102) will be smaller than those in Eq. (6-103). In 
the event of substantial collinearity this drop in precision may be serious, but 
subject to this qualification, including irrelevant variables would seem a less 
serious problem than the exclusion of possibly relevant variables. Riddell and 
Buse derive all the main results for Cases 6-1 and 6-2 in a unified fashion by 
treating them as special cases of restricted least squares.} 


Case 6-3: The general case The general case relates to the mistaken use of the Xy 
matrix instead of the X matrix, as specified in Eqs. (6-96) and (6-97). The residual 
sum of squares from the regression of y on X, is 


e’e = y'Myy 
where 
M, = 1- X4(X4X+) Xe 
Substituting for y from Eq. (6-96) gives 
e’e = (XB + u)/M,(XB + u) 
= wM,u + B’X’M,XB + 2B’X’M,u 
and 


een og 6-104 
Eau, o? + 4g, (BXMeXB) (6-104) 


If no specification error were made, the M, matrix would become 
M, =1- X(X’X) 'X’ 


and the quadratic form on the right-hand side of Eq. (6-104) would yanish. For 
any specification error at all the matrix of the quadratic form, being essentially 


+See W. C. Riddell and A. Buse, “An Alternative Approach to Specification Errors,” Australian 
Economic Papers, 19, 1980, pp. 211-214. 
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the variance-covariance matrix computed from the residual vectors obtained when 
each variable in X is regressed on X,, is positive semidefinite. Thus the expected 
value of the residual variance computed from the regression of y on X, will 
exceed a” and would only fall to ¢* when X, = X. This provides a rationalization 
for the common practice of searching among regressions to find the minimum 
residual sum of squares (or maximum R?’), though, of course, in any specific 
application sampling fluctuations might yield a lower residual sum of squares for 
X, than for X. 

The result obtained in Eq. (6-98) that specification error leads to biased 
estimates of the population parameters must be interpreted with care. Suppose, 
for example, that y indicates observations on the rate of inflation, X the set of 
explanatory variables in a “fiscalist” theory of inflation, and X, the set of 
explanatory variables in a “monetarist” theory. A fiscalist will estimate Eq. (6-96) 
and a monetarist Eq. (6-97). Monetarists will have little interest in the “news” 
that their monetary coefficients are biased estimates of the coefficients of fiscalist 
variables, nor would fiscalists be interested in the reverse information. Even if one 
model really is the “true” model, the substantial correlation existing among 
economic data may well help the “wrong” theory to put up a reasonably good 
Statistical showing. We are touching on the very difficult problem of the choice 
between alternative models, which we will discuss in some more detail in Chap. 
12, 


PROBLEMS 


6-1 A data matrix of full column rank is partitioned as 
X=[X, X,] 


where X, is n X k, and X is n X ky. Show that the upper left-hand block in (X’X)~' may be 
expressed as 


(X\M)X,) 
where 


M, =I — X2(X3X2)"'X> 
Give a least-squares interpretation of M,X, and hence of X;M,X,. 


6-2 The following estimated equation was obtained by OLS regression using quarterly data for 1958 
to 1976 inclusive: 


Yy=2.20 + 0,104 x, — 3.48xy + 034.x,5 
(3.4) (0.005) (22) (0.15) 


Standard errors are in parentheses, the explained sum of squares was 109.6, and the residual sum of 
squares 18.48. 

(a) Test the significance of each of the slope coefficients. 

(b) Calculate the coefficient of determination R?. 

(c) When three seasonal dummy variables were added and the equation was reestimated, the 
explained sum of squares rose to 114.8. Test for the presence of seasonality. 

(d) Two further regressions, based on the original specification, were computed for the subperi- 
ods 1958, quarter 1, to 1968, quarter 4; and 1969, quarter 1, to 1976, quarter 4, yielding residual sums 
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of squares of 9.32 and 7.46, respectively. Test the following hypotheses: 
(i) The error variances are identical in the two subperiods. 
(ii) The coefficients are identical in the two superperiods. 
(UL, 1981) 


63 The following regression was estimated from 16 quarterly observations (¢ ratios in parentheses): 


¥, =70.7 —0.90X, + 0.435 }, + 6.55S,,— 2.835;,  R? = 0.68 
(3.7) (0.27) (3.37) (3.40) (3.37) 
where S,, = 1 in the ith quarter and 0 otherwise. Explain the implied pattern of seasonal variation and 
interpret the result. 
(UL, 1980) 
6-4 A production function model is specified as 
Y, = By + B.Xx, + BsX3; + 
where Y, = log output, X2, = log labor input, and X3, = log capital input. The data refer to a sample 
of 23 firms, and observations are measured as deviations from the sample means 
Ux3,= 12. Lx x3, = 8 
Ux3,= 12 Lyjx2,= 10 
LyPF=10 Lyjx3,=8 
(a) Estimate B, By, their standard errors, and R?. 
(b) Test the hypothesis that 8, + 8; = 1. 
(c) Suppose now that you wish to impose the a priori restriction that 8, + 8, = 1. What is the 
least-squares estimate of B, and its standard error? What is the value of R? in this case? Compare 


these results with those obtained in (a) and comment. 
(UL, 1979) 


6-5 A set of cross-section data on family income y and expenditure c is partitioned into subsets of 
observations, relating to families headed by: 


1. Manual workers 
2. Salaried workers 
3. Self-employed 


A regression of log c on log y is computed for each subsample and for the full sample, yielding: 


B # T 

Manual workers 1.02 0.24 102 
(0.06) 

Salaried workers 0.91 0.46 104 
(0.1) 

Self-employed 0.76 0.30 6 
(0.08) 

All families 0.86 0.39 232 
(0.05) 


Here B is the slope coefficient (standard errors in parentheses), 5? is the residual variance, and T is the 
sample size. 

Test the hypotheses that: 

(a) The elasticity of c with respect to y is the same for all occupational classes. 

() Its value is unity. 


Interpret your results and give some possible explanations for the observed differences. 
(UL, 1979) 
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6-6 On the assumption that the elements of B obey the restrictions 
Rp=r 
show that the variance-covariance matrix of the restricted estimator b,, defined in Eq. (6-5), is 
a = Be Ea a 
var(b.) = 0?{(X’X)~' = (Xx) 'R’[R(XX)'R’] “'ROX’X) ‘} 
6-7 The model 
Y =a, + oF, + ofF, +u 


is estimated by OLS, where E, and E; are dummy variables indicating membership of the second and 
third educational classes, respectively. Show that the OLS estimates are 


a} _ Y, 
a3/-|%-¥, 
a] [h-¥ 


where ¥; denotes the mean value of Y in the ith educational class, 

6-8 Rework the estimation problem based on the data in Table 6-6, using any other cell as the starting 
position and confirm that one obtains the same numerical estimates of the expected number of hours 
as those given in Table 6-7. 


6-9 Prove the result on MSEs stated in Eq. (6-86). 
6-10 Derive the result given in Eq. (6-91). 
6-11 The set of restrictions RB = r, with appropriate partitions of R and B, may be reformulated as 
R,B, + RB, =r 

where R, is q X q and nonsingular and R, is q X (k — q). Show that the restricted estimator by, 
defined in Eq. (6-5), may be obtained in two Stages, namely; 

(a) Regress the vector (y — X,Ry'r) on the matrix (X, — X,Rj'R,) to obtain an estimate b, of 
B,. 

(b) Substitute this estimate in 

B, = Ry '(r ~ RB) 


to obtain an estimate of B,. 


CHAPTER 


SEVEN 


MAXIMUM LIKELIHOOD ESTIMATORS AND 
ASYMPTOTIC DISTRIBUTIONS 


7-1 REVIEW AND PREVIEW 


Chaps. 5 and 6 have set out the main features of the k-variable linear model. It is 
very important to emphasize that the results obtained so far depend upon the 
particular set of assumptions made in specifying the model. It will be helpful to 
review those assumptions and results very briefly as this sets the stage for the 
remainder of the book, which is concerned with the many problems that arise in 
econometrics when various assumptions underpinning the simple model that we 
have considered so far have to be revised and extended. 
The k-variable linear model with n sample observations was specified as 


y =X,B + u 
Lod s) 4 


(n * In Xk) (K&D) (nx 1) 

with two crucial sets of assumptions, namely, assumptions about the X matrix 
and assumptions about the disturbance vector u, that is, 

1. X is of full column rank and nonstochastic 

2. whas the properties E(u) = 0 and var(u) = oI 
or 

3. u~ NO, 071) 

267 
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The combination of assumptions 1 and 2 yields the result that the OLS 
estimator b = (X’X)~'X’y with var(b) = o7(X’X)~! is a best linear unbiased 
estimator of B. The development of inference procedures required an assumption 
about the form of the distribution of the disturbance term, and the combination 
of assumptions 1 and 3 resulted in a comprehensive set of exact, finite sample 
inference procedures—tests of coefficients, confidence intervals, analysis of vari- 
ance procedures, tests of structural change, and so forth. 

The above assumptions are very restrictive, and parts of Chap. 6 examined 
some issues relating to the X matrix. Sections 6-3 and 6-4 indicated various 
applications resulting from the incorporation of dummy variables among the X’’s. 
Section 6-5 examined the problems that arise when the X variables are highly 
correlated, and Sec. 6-6 discussed the problems involved in specifying the X 
matrix, that is, in knowing which variables in what functional form should 
comprise the columns of X. None of these issues violates the basic assumption 
that X was nonstochastic. It is, however, very important to relax this assumption. 
Also important is the relaxation of assumptions about the disturbance term. We 
will see in Chap. 8 that many real-world situations would preclude var(u) from 
having the extremely simple form set out in assumption 2, and it is important to 
develop appropriate estimators far these more complicated situations. We also 
need to ask what are the effects of removing the normality assumption for the 
disturbance term. Finally we note that when X is nonstochastic, there is no 
question of any statistical dependence between the X’s and the u’s, but when the 
nonstochastic assumption is removed, this now becomes a possibility to be 
investigated, and, in fact, this particular problem has generated some of the major 
developments in econometric theory. 

In tackling this broader range of complex problems, the least-squares princi- 
ple alone cannot always yield an appropriate estimator. We need, therefore, to 
introduce the powerful maximum likelihood principle. Furthermore, in many of 
the new problems it proves excessively laborious and often impossible to derive 
exact finite sample results, but it is possible to derive results which hold in the 
limit, or asymptotically, as the sample size becomes infinitely large. Thus we need 
a simple introduction to asymptotic theory, and this is attempted in Sec, 7-2, 
followed by an introduction to maximum likelihood estimators in Sec. 7-3. 


7-2 SOME REMARKS ON ASYMPTOTIC THEORY 


Asymptotic theory is concerned with the behavior of random variabl’. °s the 
sample size tends to infinity. To illustrate, let X,, denote the mean of a rendom 
sample of n observations drawn from some population of x values. Or let b, 
indicate the estimated slope of an OLS regression based on n pairs of sample 
observations. Both x, and b, are random variables with probability density 
functions (pdf) denoted, say, by f(X,|#,07) and f(b,|a, 8,02). The first pdf 
assumes that the x distribution involves just two parameters, the mean and the 
variance a7, The second pdf involves the three parameters of the two-variable 
linear model of Chap. 2. 
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The crucial question in asymptotic theory is how random variables such as X,, 
or 5, and their pdf’s behave as n > oo. For our purposes there are two main 
aspects of this behavior, the first relating to convergence in probability and the 
second to convergence in distribution. 


Convergence in Probability 


A basic result in elementary statistics states that, if the x’s have been drawn at 
random from some distribution with mean p and variance o?, 
ot 

E(%,) = and var(x,) = 7 
Thus X,, is an unbiased estimator of « for any sample size, and the variance tends 
to zero as n increases indefinitely. It is then intuitively clear that the distribution 
of X,, whatever its precise form, becomes more and more concentrated in the 
neighborhood of p as n increases. Formally, if one defines a neighborhood around 
Has p + e, the expression 


Pr(w —e<¥,< +e) = Prilx, — Hl <e) 
indicates the probability that x,, lies in that interval. The interval may be made 


arbitrarily small by suitable choice of e. Since var(X,,) declines monotonically with 
increasing n, there exists a number n* and a 6 (|6| < 1) such that for all n > n*, 


Pr(jx, — p| <e}> 1-6 (7-1) 


The random variable X,, is then said to converge in probability to the constant p.. 
As n increases, the probability of x, lying in a specified interval becomes larger, 
that is, 8 becomes smaller. Thus an equivalent statement is 


lim Pr(\x,— wl <@)=1 (7-2) 
oy 


In words, the probability of X, lying in an arbitrarily small interval about p can 
be made as close to unity as we desire by letting n become sufficiently large. A 
shorthand way of writing Eq. (7-2) is 

plim x, =» (7-3) 
where plim is an abbreviation of probability limit. The sample mean is then said 
to be a consistent estimator of the population mean p. Bya similar argument the 
reader may easily show that, in the two-variable regression, 5, is a consistent 


estimator of B, since it was shown in Chap. 2 that 
2 


b,)=— 
E(b,)=B and — var(b,) = Ee 


These two examples are very simple in that the estimators are unbiased for all 
sample sizes. Suppose we have another estimator, m,,, of u such that 


c 
E(m,)= "+5 
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where c is some constant. This estimator is biased in finite samples, but 


lim E(m,) = 


n> oo 


and m,, is said to be asymptotically unbiased.+ Provided var(m,,) goes to zero as n 
increases, it may be shown, by use of Chebysheff’s theorem, that m, is a 
consistent estimator of p. 
Chebysheff’s theorem states that for a random variable X with finite mean 
and variance, and o, and for given A > 0, 
Pr{|x — p| > Ao) < 3 


Applying the theorem to this example gives 
Pr om (u + <)|> Ajvar(m,) } ne 
Setting e = A/var(m,,), this becomes 

Pr fe (u+<)|> e} Svariny) 


2 
Taking the limits of each side as n goes to infinity gives 


& 
lim Pr{jm, — p| > e) =0 (7-4) 


Thus m,, is a consistent estimator of pm, since Eq. (7-4) is equivalent to the 
definition of a consistent estimator in Eq. (7-2). So a sufficient condition for 
consistency is that an estimator should be asymptotically unbiased and have a 
variance which converges to zero. 

One of the great advantages of probability limits is their simplicity of 
operation, as illustrated by the following examples: 


plim(x?) = (plim x)? 
plim(x~') = (plim x)! 


~ (x) _ plim(x) 
plim( =) 7 plim(y) 


whether or not x and y are independently distributed. Probability limits may also 
be extended to vectors and matrices. It simply means taking the probability limit 
of each element of the vector or matrix, provided of course that such probability 
limits exist. Operation with these probability limits is again extremely simple. For 


An alternative definition of asymptotic unbiasedness will be given shortly in the discussion of 
convergence in distribution. 

£ Consistency, however, does not necessarily imply asymptotic unbiasedness. The standard coun- 
terexample is given in W. P. Sewell, ““Least-Squares, Conditional Predictions and Estimator Proper- 
ties,” Econometrica, 37, 1969, pp. 39-43. 


MAXIMUM LIKELIHOOD ESTIMATORS AND ASYMPTOTIC DISTRIBUTIONS 271 
example, 
plim(AB) = plimA - plimB 
plim(A~') = (plim A) ~' 
As an illustration, recall the OLS coefficient vector from the basic model in Chap. 
5, namely, 


b = (X’X) 'X’y 
=B+ (XX) ‘Xu 
~8+ (Gx) (7x4) 
F n n 4 
The matrix 
(7%) 
n 


consists of the mean squares and mean cross products of the explanatory 
variables. If the X matrix is constant in repeated samples, then} 


lim (xx) = (-xxx) 
no \Nn n 
If the explanatory variables are stochastic, it can be shown that the sample 
moments will converge in probability to the population moments. Thus we write 


plim( =x°x) =3 (7-5) 


where © is a given symmetric, finite, positive definite matrix. It remains to 
evaluate 


ays (ial 
plim *ru,) 
1 plim 15 Xm) 
plim( *X’u) ae ln 
n 
a : 
plim +5x,,u,) 
The element inside the first parentheses is w. Since E(u) = 0 and var(w) = 07/n, 
it follows that plim(@) = 0. For the ith element 
1 
£(5EX,u,] = 0 


which holds both for the case where X is fixed and also for stochastic X on the 
assumption of zero covariance with u. Also 


1 i 
ee 


+ For the proof see H. Theil, Principles of Econometrics, Wiley, New York, 1971, pp. 364-365. 
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In view of Eq. (7-5), the probability limit of DX7/n is a constant. Thus the 
probability limit of the variance is zero. Repeating the argument for the other 
terms, 


Rep oC 
plim( Xu) =0 
and so 
=1 
Biimb ipa plim( xx) plim( Xu) 
=Bp+2z°'-0 
Sal 


which proves the consistency of the OLS estimator. 


Convergence in Distribution 


Return again to the sample mean X,. If the population from which the x’s are 
drawn at random may be characterized by 


x~ M(u,02) 

then x, being a linear combination of normal variables, has a normal pdf. Thus 
bs o 
Bio ir) 


and f(x, is normal for every n. The limiting distribution is found by examining 
what happens to /(<,,) as n goes to infinity. Since var(X,,) goes to zero, the whole 
mass is concentrated on the point in the limit and the distribution is said to be 
degenerate. A simple transformation of X,,, however, can lead to a limiting 
distribution which does not collapse on a single point. Consider 


z, = ¥n(%, — 4) 
Clearly, E(z,,) = 0 and var(z,) = 07. Thus 
f(z,) is (0,07) for any n 


that is, the limiting distribution and all finite sample distributions are identical 
since the parameters of the distribution do not involve n. 

The real application of these ideas comes in situations where finite sample 
pdf's either cannot be derived at all or are very difficult to derive and manipulate, 
but a tractable limiting distribution can be obtained. The limiting distribution may 
then be taken as an approximation for the unknown or intractable finite sample 
distribution. As an illustration, suppose the random variable X has mean pe and 
variance 0”, as before, but the distribution of X¥ is no longer normal. A 
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fundamental result, the central limit theorem, states that} 
the limiting distribution of z,, = yn, — p) is N(O, 0”). 


Thus irrespective of the form of f(x), the limiting distribution of z,, is still normal, 
though the quality of the approximation to any finite pdf will be influenced by the 
extent to which f(x) departs from normality. Alternative ways of expressing this 
result are 


Vn (X,, — #) converges in distribution to N(O, 07) 
or 


Un (%,— uw) > N(0, 0?) (7-6) 


This result is often expressed loosely as “X,, is asymptotically normally distributed 
with mean p and variance o*/n” and o7/n is then referred to as the asymptotic 
variance of x,. A shorthand version of this statement is 


o 

x, ~ AN, al (7-7) 

with 
2 

asy var(X,,) = A 
where AN indicates asymptotically normal and asy var, asymptotic variance. As 
already emphasized, the limiting (asymptotic) distribution of x,, is degenerate. 
The practical import of Eq. (7-7) is that in cases where f(x,,) is intractable we are 
taking it, for sufficiently large n, to be approximately normal with mean p and 
variance o/n. These procedures extend directly to multivariate situations, and 
we will give an example in Sec. 7-4 with a treatment of the k-variable linear model 
when the disturbances are nonnormal. 

If we consider the class of consistent and asymptotically normal estimators, 
the one with the minimum asymptotic variance is said to be asymptotically 
efficient. The mean of the asymptotic distribution provides an alternative measure 
of the asymptotic expectation of an estimator, and hence of asymptotic bias. 
Previously we implicitly defined the asymptotic expectation (AE) of an estimator 
6 as 


AE(6) = lim E(6) 
noo 
The alternative definition is 
AE(6) = mean of the asymptotic distribution of 6 


Similar definitions apply to second- and higher-order moments. In many cases the 
two definitions are equivalent, but there are instances where it is important to 


+ For a proof see H. Theil, op. cit., pp. 367-369. 
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distinguish between limits of sequences of moments and the corresponding mo- 
ments of a limiting distribution. Sometimes the moments of a finite sampling 
distribution may not exist or cannot be established, although a limiting distribu- 
tion with well-defined moments does exist.} 


7-3 MAXIMUM LIKELIHOOD ESTIMATORS 


We will illustrate the principle of maximum likelihood (ML) estimation in the 
context of the linear regression model.¢ Let us retain the assumption of a fixed 
nonstochastic X matrix. The model 


y=Xpt+u (7-8) 
then defines a transformation from u to y. The assumption of a multivariate 
density function for u implies a multivariate density fuaction for y, which may be 


written 
ou 
dy 


where |du/dy| indicates the absolute value of the determinant formed from the 
matrix of partial derivatives§ 


p(y) = p(u) 


Ou, du, 
dy, dy, 
uy du, 
ay, OY, 
oe 568 
dy, IY, 


In the case of Eq. (7-8) this matrix is seen to be the identity matrix whose 
determinant is unity. Thus 


p(y) = p(u) 


If we further assume, as before, that u is multivariate normal with mean vector 0 
and variance matrix 671, then formula (5-5) gives 


1 1 
aaa) 
and so 
Ply) = 


ae sly - xBy(y - x8)| (79) 


+ For details see H. Theil, op. cit., pp. 375-378. 

+For a general account of estimators the reader might consult P. G. Hoel, Introduction to 
Mathematical Statistics, 4th ed., Wiley, New York, 1971, pp. 196-200; L. D. Taylor, Probability and 
Mathematical Statistics, Harper and Row, New York, 1974, pp. 197-230; or M. G. Kendall and A. 
Stuart, The Advanced Theory of Statistics, vol. 2, Griffin, London, 1961, pp. 1-67. 

§ See App. A-9, Change of Variables in Density Functions. 
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Equation (7-9) involves both the observations on y and the unknown parameters 
B and o*. Writing p(y) in the form L(y;B,o7) emphasizes that it is the 
probability density for the y’s, given the parameters B and o*. Alternatively, 
writing it as L(B, 0”; y) stresses that for given y it can be regarded as a function 
of the parameters. It is termed the likelihood function and is conventionally 
denoted by the symbol L. 

The ML principle is to choose as estimators of B and o? the values which 
maximize the likelihood function, given the sample data y. Letting 6’ = [B’ 07] 
denote the vector of unknown parameters and 6 the ML estimator, 6 is obtained 
as the solution of the equation 

OL 
rz) 


In practice the derivation of the ML estimators is often simplified by maximizing 
the log of the likelihood function, that is, by finding 6 as the solution to 


=0 (7-10) 


=0 (7-11) 


Since 


the same vector 6 is obtained as the solution to Eqs. (7-10) and (7-11) for any 
L>0. 
Taking the natural logarithm of the likelihood in Eq. (7-9) gives 


In L = ~Zn(2n) ~ 5n(o?) ~ 51 (y ~ XB)'(y ~ XB) 


Differentiating partially with respect to B and o7 and evaluating these derivatives 
at the ML estimators gives the specific form of Eq. (7-11) as 


a(n L) 1 ‘ een "¥8) = 
a = — —(-2X’y + 2X'XB) = — (X’y — X’XB) = 0 
7B agit 2X + ) gi (xy B) 
) ; (7-12) 
A(In L 1 D Se 
aint) 14 1 fy xB)(y— xb) =0 
do? 26°) 2G" y yt 
The simultaneous solution of these k + 1 equations gives 
B = (x’X) 'X’y (7-13) 
and 
ea (7-14) 
n 


The ML B is seen, in this case, to be identical with the OLS b. The estimate of 0, 
however, differs from the unbiased s? of Eq. (5-57) by the factor (n — k)/n, 
which illustrates the fact that ML estimators are not necessarily unbiased. In this 
application B is an unbiased estimator of B, but é? is a biased estimator of a 
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Properties of ML Estimators 


ML estimators have a number of desirable properties, some of which hold for 
finite samples and some of which only hold asymptotically. Of the finite (small 
sample) results one of the most important is the following: 


If a minimum variance bound (MVB) estimator exists, it is given by the ML 
method. 


The minimum variance bound (MVB), developed in the remarkable Cramer- 
Rao theorem, establishes a minimum for the variance of an unbiased estimator. It 
is important to note that the theorem relates to the class of unbiased estimators 
and not just to the subset of /inear unbiased estimators. Furthermore the theorem 
establishes a lower bound for the variance, but there may, of course, be situations 
where the bound cannot be attained, that is, where one can derive a minimum 
variance unbiased (MVU) estimator, but its variance will exceed the MVB. The 
bound is derived from the likelihood function. 

Consider first of all the case of a single unknown parameter 6, a density 
function f( y|@), and a random sample of n observations from this density. The 
likelihood function is then 


n 
L(8ly) = T1s(18) 
a 
Let 6 denote an unbiased estimator @. Then the Cramer-Rao theorem states 


1 a 

aa euabrabons | (7-1 5) 
( dink y ain L 

9 E 

00 ae? 
where either of the expressions on the right-hand side indicates the MVB. For the 
multiparameter case, let 6 denote an unbiased estimator of the vector 6 of, say, k 
unknown parameters. Now we have a variance-covariance matrix for the elements 
of 6, denoted by var(8). The multidimensional equivalent of 


var(6) > 


2 
e| in L 
a0? 
in Eq. (7-15) is now the symmetric matrix 
in L 
id -2( 30 a0" ] 
?inL dnl : 07In L 
a6? 30, 40, 20, 20, 
PinL dnb #in L 
=-E a 
30, 06, 9g? a, a0, (7-16) 
Pink #inkL Bin 
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R(8) is often referred to as the information matrix. The multidimensional version 
of the Cramer-Rao theorem now states 


var(6) — R~'(@) is a positive semidefinite matrix 


Thus the MVB, for any 6,, is given by the ith element on the principal diagonal of 
R-'(6).4 

As an illustration of this result, let us return to the k-variable linear model. 
The first-order derivatives of the likelihood function were given in Eq. (7-12). 
Differentiating these again we obtaint 


d*(In L) WEY 

3p ap” 2B eae 

a(InL) __n__ (y—XB)‘(y — XB) 
a(o?)? 20° " 

a?(In L) ayia 


1 
—(X'y — XX 
aB a0? 208 XY 8) 


Taking expectations of these second-order derivatives and reversing signs gives 


wef 2002) Lax 


o 
a?(In L) n noe on 
= + = 
a(o2)? ae hohe 
since 
E(y — XB)'(y — XB) = E(u'u) = no? 
2 
a _E ae | =0 
OB do? 
since 


E(X’y — X'XB) = E(X"(y — XB)) = E(X'u) = 0 
Substituting in Eq. (7-16) and inverting the resulting matrix then gives 


a| B ] = a8 a (7-17) 


We see immediately that the ML (OLS) estimator of B attains the Cramer-Rao 
MVB, since var(f) = var(b) = 07(X’X)~', which is identical with the top left-hand 


+ Derivations of the Cramer-Rao MVB may be found in P. G. Hoel, op. cit., pp. 362-365; L. D. 
Taylor, op. cit., pp. 209-213, and M. G. Kendall and A. Stuart, op. cit., Chap. 17. 

Note that Eq. (7-12) contains B and 6? because the first-order derivatives had been equated to 
zero. We now ignore the equalities, replace B by B, 6? by 0”, and differentiate again. 


278 ECONOMETRIC METHODS 


submatrix in R~'. The same result does not hold for either the OLS or the ML 
estimator of o?. The OLS estimator is 


and it was established in Eq. (5-59) that 


ie 
zits) x7(n-k) 
Thus 


2 a 2 
gaat — ed 


Recalling that the variance of a x? variable is equal to twice its number of degrees 
of freedom, 


4 204 
var(s?) = aN -—k)= 
(n—ky sap Mis 


which, for any finite n, is somewhat greater than the variance term given in R-'. 
There is, in fact, no unbiased estimator of 0? which can attain the MVB. The 
derivation of the variance of the ML estimator, 67 = e’e/n, is left as an exercise 
for the reader, but in any case it is a biased estimator of 07. 

A second important feature of ML estimators is their invariance property, 
which holds for any sample size and may be stated as follows: 


The ML estimate of a function g(@) is g(6), where 6 is the ML estimate of 8. 

We have seen that the ML estimate of o7 in the regression model is 
6? = e'e/n. Thus the ML estimate of o is simply ye'e/n. 

The most important result about ML estimators relates to their large sample 
or asymptotic properties. 


Under certain regularity conditions, ML estimators are consistent, asymptoti- 
cally normally distributed, and asymptotically efficient. 


Specifically, if 6 denotes the ML estimate of 8, 


plim6 = 6 (7-18) 
6 ~ AN(0,R~') (7-19) 
where R has already been defined in Eq. (7-16) as 
ae in L 
ae ( 30 a0” 


+ Reference may be made, for example, to Kendall and Stuart, op. cit., for a comprehensive 
statement of the underpinning assumptions and the derivation of this and other results. 
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Thus the ML estimators, besides being consistent and asymptotically normal, are 
efficient in that the asymptotic variance matrix reaches the Cramer-Rao lower 
bound. 


7-4 SOME ASYMPTOTIC RESULTS FOR THE k-VARIABLE 
LINEAR MODEL 


In this section we will relax two of the assumptions underpinning the k-variable 
linear model, namely, the normality of the disturbance term and the nonstochas- 
tic nature of the X matrix. 


Nonnormal Disturbances 


Let us retain assumptions | and 2 of Sec. 7-1, that is, 


1. X is nonstochastic of full column rank k. 
2. E(u) = 0 and var(u) = 671. 


but dispense with the assumption of a normal distribution for the u’s. Under 
assumptions | and 2 the OLS b is still a best linear unbiased estimator of B with 
variance matrix o7(X’X)~'. Moreover, as already shown, b is a consistent estima- 
tor of B. Thus even when the disturbances are nonnormal, OLS is still a very 
acceptable technique for deriving point estimates. The difficulty is that the various 
exact inference procedures outlined in Chaps. 5 and 6 are no longer strictly valid 
since their derivation depended on the assumption of normality. However, one 
may conjecture that the procedures are reasonably robust for moderate depar- 
tures from normality.+ More importantly the tests can be given a large sample 
justification. This requires the use of two theoretical results. 

First, if X is nonstochastic of full column rank k, E(u) = 0, var(u) = 071, the 
elements of X are uniformly bounded, and lim,_,,,(1/n)X’X = %, a finite, 
symmetric, positive definite matrix, thent 


| (xu) ~ AN(0, 023) (7-20) 
vn 
+“... it has been shown that these tests are not very sensitive to departure from normality. If the 


errors are not normally distributed but have a variance, it is generally true that only trivial errors are 
made in the powers or the levels of significance if we retain the formulae which are strictly applicable 
in the case where the errors are normal.” E. Malinvaud, Statistical Methods of Econometrics, 2nd ed., 
North-Holland, Amsterdam, 1970, p. 99. See also Malinvaud’s discussion on pp. 296-302. Additional 
references on this topic are P. Schmidt, Econometrics, Marcel Dekker, 1976, pp. 55-64; A. C. Harvey, 
The Econometric Analysis of Time Series, Wiley, New York, 1981, pp. 112-117; and G. G, Judge, 
W. E. Griffiths, R. C. Hill, and T, C. Lee, The Theory and Practice of Econometrics, Wiley, New York, 
1980, Chap. 7. 
+ For a proof see P, Schmidt, op. cit., pp. 56-60. 
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Second, let (X,, ¥,) denote a sequence of pairs of random variables, where X,, 
has a probability limit and ¥, a limiting distribution, that is, 


plim X, =c¢ 
and 
¥, > $Y) 
then 
(X%,) 3 £(e¥) 


or, in words, the product X,Y, has the limiting distribution f(cY).+ Hf, in 
particular, Y, has a normal limiting distribution 


¥, > N(u,0*) 
then 
D 
X,Y, 2 N(cp, c?07) 


The multidimensional version of the same result is as follows: Let y,, denote a 
k X 1 vector with limiting distribution 


Ye > N(n,@) 
and z, an r X | vector defined by 
z, = H,y, (7-21) 
where H,, is an r X k matrix with probability limit 
plimH, =H 
then 
z, > N(Hp, HOH’) (7-22) 


Returning now to the main argument 
-1 
b-B- (7x) (7x1) 
n n 
Thus 


inte B)=(5xx) (Lava) 


This is seen to be in the form of Eq. (7-21), and a direct application of Eq. 
(7-22) gives the result that Vn (b — B) has a limiting normal distribution with zero 


+See C. R. Rao, Linear Statistical Inference and Its Applications, Wiley, New York, 1965, pp. 101 
ff, for this and other important limit theorems. 
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mean vector and variance matrix given by 
ae oS 
Thus we may write 


Vn (b — B) > N(0, 027') (7-23) 
b~ AN(B. 7072 "*) (7-24) 


In a practical application 2 is unknown and is replaced by the sample estimate 
((1/n)X’X) and 0? is estimated by s? = e’e/(n — k), where e = y — Xb. Thus the 
estimated asymptotic variance-covariance matrix for b is 
asy var(b) = s*(X’X) eh 

which is the finite sample estimator of Chaps. 5 and 6. It can be shown that Eq. 
(7-24) essentially ensures that all the conventional ¢ and F tests are valid 
asymptotically.} It is a moot point whether in the test of a single restriction 
critical values should be taken from N(0, 1) rather than t(n — k), and in the test 


of a set of q restrictions, whether one should consult the x2/q distribution or 
F(q, n — k). However, it is not a matter of great importance since 


t(n — k) > N(0,1) 
2 
ees 
q 


and, moreover, under the assumptions made in this section, the procedures are 
only valid asymptotically so that one should not place undue emphasis on precise 
levels of significance. 


and F(q,n-—k) 


Stochastic X Matrix 


The explanatory variables in an econometric relation are not usually subject to 
control by the economic researcher, the secretary of the U.S. Treasury, or anyone 
else. Rather they are mostly the outcome of the functioning of some 
economic/social system. Let us, therefore, characterize the X,, (i = 1,...,k 
t= 1,..., n) as possessing some multivariate density function g(X). We will make 
two crucial assumptions about this density function, namely: 


1. The parameters of g(X) do not involve the parameters B and o? of the 


regression model. 

2. X and wu are independently distributed. This is the strong assumption of full 
independence. It implies, for example, that the disturbance u, is independent 
of X,, for all i= 1,...,& and all t= 1,..., 1 or, in other words, that u, is 


+ For details see P. Schmidt, op. cit., pp. 60-64. 
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independent of all past values, the current values, and all future values of all 
explanatory variables. This strong assumption would be violated if a lagged 
value of Y, such as Y,_,, appeared among the explanatory variables, for u,_, 
influences Y,_,, so Y,_, is not independent of u,_,. Furthermore u,_, 
influences Y,_,, which in turn influences Y,_,. Thus Y,_, is dependent on 
U,_\, U,-5-+-, but is independent of u,, u,, ,, and all later u’s.} 


Formally we may express assumption 2 as 
p(u|X) = p(u) 
E(u|X) = E(u) 
E(y|X) = E(XB + u|X) = XB + E(u) 
E(uu'|X) = E(uu’) 
The likelihood of the sample observations on both Y and X’s may be written 
p(y,X) = p(ylX) - g(X) 
= p(ulX) « g(X) 


: = p(u) -g(X) 
We also retain the assumption: 


3. u~ N(0, 07D) 
Thus the log likelihood becomes 
inE = ="“in(an) — "1n(o2) — —-(y — xb)'(y — Xb) + In g(X) 
2 2 202 


This differs from the likelihood for the fixed X model only in the additional term 
in In g(X). Since this term does not involve B and o?, the ML estimators of these 
parameters are the same as those already given for the fixed X model in Eqs. 
(7-13) and (7-14). Thus the ML estimator of B, which still equals the OLS 
estimator, will at least have desirable asymptotic properties when the X’s are 
stochastic. Furthermore, reworking the development leading up to Eq. (7-17) now 
gives 


a-(?)- o7E(X’X)' 0 
= 


4 7-25 
0 20° 2) 
n 


so that the asymptotic variance matrix for B(= b) is o2E(X’X)~', which is the 
MVB. 


Turning to the small sample properties it is easy to show first of all that b is 
an unconditionally unbiased estimator. From Chap. 5, 


b=B+ (XX) ‘Xu 


o 


+ This case is treated in Sec. 9-2. 
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Thus 
E(b) = B + E{(X’X) 'X’u} 
= B+ E{(X’X) ON - E(u) since X and ware independent 
=p since E(u) = 0 
For the variance-covariance matrix 
var(b) = E{(b — B)(b — B)’} 
= E{(X’X)'X’uu’X(X’X)'} 
= E,{ Ey, (X’X)'X’u'X(X'X)'} 
where E,), indicates the expected value in the conditional distribution of u given 
X, and E, indicates the expected value in the marginal distribution of X.} Thus 
var(b) = E,{o?(X’X) '} 
= 07E(X’X)”' (7-26) 
This shows that the finite sample variance attains the Cramer-Rao lower bound 
and differs only from the corresponding formula in the stochastic case in that 
(X’X)~! is replaced by E(X’X)~'. 
Formula (7-26) may be established in an alternative and instructive fashion. 
It was shown in Eq. (5-33) that the variance matrix for b, given some X matrix, 
was o2(X’X)~'. Emphasizing the conditional nature of this variance we can write 
var(b|X) = o?(X’X) | 
or letting S = X’X, 
var(b|S) = 0?S™! 
Now suppose that the random X’s can, in principle, generate a finite number of S 


matrices S,, S;,..., 5, with probabilities pj, P2,---» Pm: Then the unconditional 
variance matrix for b, determined from the marginal distribution, is 


var(b) = ¥ var(b|S;) - 7; 
i=1 
oY 8, '2; 
i=1 
o?- E(S"') 
o2E(X'X) ' asin Eq. (7-26) 


It is clear that the same argument goes through when the X’’s are continuous. 
Unfortunately both elements in Eq. (7-26) are unknown. Intuition suggests 


replacing 


A 22 ee 
0 by s age 


+ See App. A-8, Expectations in Bivariate Distributions. 
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and E(x sib by we (eX) 
It can be shown that the result is an unbiased estimator of Eq. (7-26) for 
E{s?(X'X)'} = E,{ E,2,5(X’X)'} 
= E,{0*(X'X) '} 
= 07E(X’X) | 

To summarize the position so far, when the X’s are stochastic but indepen- 
dent of the u’s, the ML (= OLS) estimators for the fixed X case are still ML for 
the stochastic case, and thus all the conventional test procedures are still justified 
asymptotically. For finite samples the OLS (ML) b(B) is unconditionally unbi- 
ased, and var(b) attains the Cramer-Rao MVB. Moreover the conventional 
estimator s?(X’X)~' is an unbiased estimator of that MVB. 

The only remaining question concerns the finite sample validity of the 
conventional inference procedures. The basic result is that all confidence interval 
statements and significance levels derived from the usual formulas are still correct, 
but the probabilities of type II errors and the widths of confidence intervals will 
be different. Confidence intervals and hypothesis tests are derived by calculating 
probabilities from the sampling distribution of some appropriate test statistic 
under Hy. The test statistic is in general some function of the sample observations 


and may be denoted by ¢(y, X). If, for example, the test statistic is found to follow 
the ¢ distribution, then we may make the probability statement 


Pr{—t, 2 < t(y,X) <t,}=1-a (7-21) 
This statement may be used to derive a confidence interval or, equivalently, to 
determine a critical region for a test of Hp at the « level of significance. 

The statement in Eq. (7-27), however, has been derived under the assumption 
of a fixed X matrix, and so it is a conditional probability statement. We need to 
find what unconditional probability statement can be made about 1(-) when the 
stochastic nature of X is allowed for. Let A denote the event 

Al ~tay <t(-)< tap 
and let us suppose that the pdf for X gives a finite number of X matrices, 
X,, X2,2.., X,, 
with associated probabilities 


m 
Pi» Pas-++>Pm With Yp, = 1 


A restatement of Eq. (7-27) then gives 
Pr{A|X,}=1-—«a i=1,...,m 
and the unconditional probability of A is 


Pr( A) = ¥ Pr(A[x,) = (1 J g\S ps l-a (7-28) 
i=1 


i=l 
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Thus a confidence interval computed in the usual way will have the same 
confidence coefficient for random X as for fixed X, and a hypothesis test will have 
the same significance level in each case. The assumption of a discrete distribution 
for X is only a simplification, and it is clear that the argument carries through for 
continuous distributions.j Notice that it has not been necessary to derive the 
sampling distributions of the estimators to establish the above result. These 
distributions will be more complicated than those already obtained in Chap. 5. As 
an illustration consider the ML estimator 8 of the slope coefficient 8 in a 
two-variable model. Letting x denote the vector of sample observations on the 
explanatory variable, we know that the conditional distribution of B, given x, is 


s(BIx) = (a. 25] 


When x is stochastic, the marginal distribution of Bis 


£(B) = £(BIx) - 8x) 


where g(x) is the marginal distribution of x. Clearly, the marginal distribution 
depends on the distribution of x, but the crux of the matter is that, provided x 
and u are fully independent and the parameters of g(x) do not involve B or o”, the 
precise form of g(x) does not affect the probability statements underlying 
confidence coefficients and significance levels. 


PROBLEMS 


7-1 Derive the mean and the variance of the ML estimator 6? = e’e/n of the disturbance variance for 
the regression model y = XB + u with wu ~ N(O, 07H). (Hint: If w ~ x(r), then E(w) =r and 
var(w) = 27.) 

7-2. Prove that the OLS estimator s? = e’e/(n — k) and the ML estimator of 07 in the regression 
model of Problem 7-1 are both consistent. 

7-3 Suppose that we have n independent observations y,, Y2,---» Yn» SAY, incomes, drawn by simple 
random sampling from a Pareto distribution which has the following pdf: 


10,000" 
P(yla)=——— y= 10,000; a > 0 


What is the ML estimate for a? 
(University of Washington, 1979) 


7-4 Consider a regression equation 
yrothy te ian 
where x, is nonstochastic and ¢,€2,---,& are independently and identically distributed. The 


distribution of e, is 
fq) =AgQetP se <00;A>2 


+ A complete derivation of this and other results is given in F. A. Graybill, Theory and Application 
of the Linear Model, Duxburg, Press, Mass., 1976, chap. 10. 
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Suppose A is not known. Set up the likelihood function for y,, y2,-.., y, and describe a way to obtain 
ML estimates of a, 8, and A, 


(University of Michigan, 1977) 
7-5 Consider the uniform density 


{(X)=1/a 0<X<a 


What is the ML estimator for a? Does it attain the Cramer-Rao lower bound? Compare the 
asymptotic efficiency of the ML estimator for a with the alternative estimator derived from using the 
sample mean, and prove the consistency of that estimator for a. 


(University of Chicago, 1975) 


CHAPTER 


EIGHT 
GENERALIZED LEAST SQUARES 


A basic assumption underpinning the methods outlined in Chaps. 5 and 6 is that 

E(uu’) = 07 I (8-1) 
This is described as the assumption of spherical disturbances. It involves the 
double assumption that the disturbance variance is constant at each observation 
point and that the disturbance covariances at all possible pairs of observation 
points are zero. We now seek to do four things, namely: 


1. To indicate some of the more important cases in which assumption (8-1) may 
not be fulfilled 

2. To determine the properties of OLS estimators if they are (perhaps inad- 
vertently) applied, even though the underpinning assumption about the 
disturbances is not valid 

3. To develop tests of whether assumption (8-1) has broken down 

4. To develop appropriate estimation procedures for cases where assumption 


(8-1) does not apply 


8-1 SOURCES OF NONSPHERICAL DISTURBANCES 


If the sample observations relate to households or firms in a cross-section study, 
the assumption of a common disturbance variance at all observation points is 
often implausible. For example, if Y refers to family expenditure and X to family 
income, the variance about the Engel curve is likely to increase with the size of X. 


287 
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Similarly if Y denotes profits and X is some measure of firm size, the same 
property is to be expected. The specification of the disturbance variance matrix 
would then be 


o, 60 0 
E(w’)=|0 of --- 0 (8-2) 
0 Oo 0. 


which is the standard case of heteroscedasticity. Formulation (8-2) still assumes 
that the disturbances are pairwise uncorrelated. 

Suppose, to take a different example, that an investigator is studying the 
relationship between wage change and the level of unemployment and that he 
measures wage movements in terms of four-quarter overlapping changes. That is, 
the annual rate of wage change in quarter f is specified as 

W, — “4 
W4 : 
where w, is the level of the wage index in quarter ¢. The observed change in the 
index, w, — w,_4, is the result of some groups securing a wage change in quarter 
t — 3, some others in quarter ¢ — 2, and so forth. If one assumes one fourth of the 
labor force to secure a wage change in each quarter, the dependent variable is an 
average of these separate changes, and the disturbance term in the macrowage 
equation is similarly an average of the separate quarterly disturbances and so 
might be specified as 
u,=4(e, + 2,_, + &,_2 + 5) (8-3) 
where the e’s indicate the disturbances in the wage change equations for the 
separate groups. Let us assume 
E(e)=0 and = E(ee’) = 071 (8-4) 
It then follows from Eq. (8-3) that 
| E(u?) = 402 
which does not depend on 1, so that the {u,} series is homoscedastic. Further 
E(u,u,_,) = i602 
E(u,u, >) = io, 


E(u,u,_3) = 6% 
and 
E(u,u,,)=0 fors>4 


Thus the variance matrix following from Eqs. (8-3) and (8-4) is 


1 
E(uu’) = 16% 


(8-5) 
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This again is a departure from Eq. (8-1), but in contrast with Eq. (8-2) there is just 
one unknown in Eg. (8-5), namely, o2. This is an example of autocorrelated 
disturbances. The autocorrelation arose from temporal aggregation over individ- 
ual disturbances, which were themselves uncorrelated over time. 

To continue the wage change model, it was customary in many early studies 
of the Phillips curve for researchers to employ four-quarter overlapping changes. 
These, however, have some unfortunate side effects. In consequence it is now 
more customary to specify a model of the form 

WT We 


=at+x BPtu 8-6 
= (B+ u, (8-6) 


where the dependent variable is now expressed as a one-quarter change in the 
wage index, and x’, denotes a vector of explanatory variables, such as expected 
price inflation, the reciprocal of the unemployment rate, and so forth.} With the 
specification (8-6), however, it becomes less appropriate to specify zero correla- 
tions for various disturbances. In particular, neighboring disturbances might be 
positively correlated. Suppose, for instance, that an “unusually large” settlement 
was secured by the workers settling in period t — 1, where unusually large means 
“larger than would normally be associated with the vector x,_,.” If this leads to a 
greater than usual “push” by the workers negotiating in period 4, then we might 
specify 

u, = pUu,_, + & (8-7) 


where p is some parameter, and the {e,} series has the usual simple properties 
specified in Eq. (8-4). Equation (8-7) specifies a first-order autoregressive (Markov) 
scheme—autoregressive since u is related to lagged values of itself, and first-order 


because the maximum lag in the autoregression is 1. 
Let us introduce the lag operator L such that, when applied to any variable 


my 


1x, = X72} 
L?x, = L( Lx,) = Ex,_, = X;-2 
and, in general, 
Pxp x, S s2l 
Thus Eq. (8-7) may be rewritten as 
(1 = pL)u, = &, 
ort 
1 
CaS ales 
=(1+pL+pL?+---)e, 


+ Note that we retain our convention of using u to indicate the disturbance term in Eq. (8-6). It 


does not indicate the unemployment rate. ; t 
+The lag operator may be treated as a scalar for purposes of algebraic manipulation. For any 
nonzero constant a we have (1 — a)~'=1+a+ a? + --- . Replacing a by pL gives the result 


stated above. 
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that is, 
u, =e, + pe, + pre,_> Fits (8-8) 


Squaring both sides of Eq. (8-8) and taking expectations, 


o 


var(u,) = E(u?) = Ap nace pi (8-9) 


The right-hand side does not involve ¢, thus the {u,} series has a constant 
variance, 6? = 02/(1 — p*). 

Using the definition of u, in Eq. (8-8) and the properties of e, assumed in Eq. 
(8-4), it is simple to establish that 


E(u,u,_,) = po? 
E(u,u,_>) = po? 
and, in general, 
E(u,u,_,) = p’o” (8-10) 


Thus the variance matrix for a disturbance following a first-order autoregressive 
scheme is 


pp’ p 
E(uu’) = 07| p 1 i) Sh lk it (8-11) 
pr! ri yg “pr3 ion Ann {a 


If p were known, this expression, like Eq. (8-5), would involve only one unknown. 
There are many other ways, as we shall see later, in which nonspherical 

disturbances may arise, but these three examples illustrate some important 

patterns. The general nonspherical disturbance matrix may be specified as 


E(uu’) = 072 (8-12a) 
or E(uu’) = V (8-125) 


The choice of specification depends on whether or not we wish to single out an 
unknown scalar, which multiplies all the elements in the matrix as in Eqs. (8-5) 
and (8-11). In either case, since we are dealing with a disturbance matrix, 2 and V 
are assumed to be positive definite matrices. 


8-2 PROPERTIES OF OLS ESTIMATORS UNDER 
NONSPHERICAL DISTURBANCES 


Our assumed model is now 
y=XB+u 
where X is taken as a nonstochastic matrix with full column rank, 
E(u) =0 and E(u’) =V (or 07) 
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The OLS estimator of B may be expressed as usual as 
b=B+ (XX) ‘Xu 
Thus E(b) =B 
so that OLS is still unbiased. The variance matrix is given by 
var(b) = E{(b — B)(b — B)’) 
= E{(X’X)'X’uu’X(X’X) '} 
= 02(X’X) 'X’QX(X’X) ' (8-13) 


Thus the conventional formula o7(X’X)~' no longer measures the sampling 
variances of the OLS estimators, and any application of it is potentially mislead- 
ing. More importantly, even if one could use Eq. (8-13) to estimate the sampling 
variances, the substitution of these numbers in the conventional ¢ formulas and 
confidence interval formulas is strictly invalid since the assumptions used in 
deriving those inference procedures no longer apply. For the same reason the 
optimal minimum variance property of OLS no longer holds. We will illustrate 
these points for various specific departures from spherical disturbances in later 
sections and also discuss various specific tests for departures from spherical 
disturbances. Now it is more important to turn to the development of a more 
appropriate estimator. 


8-3 THE GENERALIZED LEAST-SQUARES ESTIMATOR 


Suppose we premultiply the assumed model 
y=XPpt+u 
by some n Xn nonsingular transformation matrix T to obtain 
Ty = (TX)B + Tu (8-14) 
Each element in the vector Ty is then some linear combination of the elements in 
y, and so forth. The variance matrix for the disturbance in Eq..(8-14) is 


E(Tuu’T’) = 0° TQT’ (8-15) 
since E(Tu) = 0. If it were possible to specify T such that 
TOT’ = (8-16) 


then we could apply OLS to the transformed variables Ty and TX in Eq. (8-14), 
and the resultant estimates would have all the optimal properties of OLS and 
could be validly subjected to the usual inference procedures. 

It is, in fact, possible to find a matrix T which will satisfy Eq. (8-16), for it 
was shown in Eq. (4-111) that if @ is a symmetric positive definite matrix, a 
nonsingular matrix P can be found such that 

Q = PP’ 
Since P is nonsingular, 
P-'oP'=1 (8-17) 
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Comparison of Eqs. (8-16) and (8-17) shows that the appropriate T is given by 

Po 
and it easily follows that 

Q-'=p-'p-! 

=(P- 'yp-! 

=TT 
Applying OLS to Eq. (8-14) then gives 

b, = (XTX) 'X’'T'Ty 


= (x'2-'x) 'x’2-'y (8-18) 
with the variance-covariance matrix given by 
var(b,) = 0?(X’‘2-'x) | (8-19) 


The estimator b, is defined to be the generalized least-squares (GLS) or Aitken 
estimator. Since Eqs. (8-15) and (8-16) imply that Eq. (8-14) satisfies the assump- 
tions required for the application of OLS, it follows that by is a best linear 
unbiased estimator of B in the model y = XB + u with E(uu’) = 07Q. 
Alternatively Eq. (8-15) may be written 
E(Tuu’T’) = TVT’ 

and setting TVT’ = I gives T’T = V_' so that the GLS estimator may also be 
written as 

by = (X'V~'X)_'x’v-ly (8-20) 
with 

var(b,) = (X’V-'x) | (8-21) 


Comparing Egs. (8-18) and (8-20) shows that it makes no difference to b, whether 
var(u) is formulated as o7@ or as V, but care has to be taken to select the correct 
expression for var(b,), as a comparison of Eqs. (8-19) and (8-21) shows.t 

If the further assumption of normality for the u’s is added, it may be shown 
that the GLS estimator is also an ML estimator. We specify 


u~ N(0, 07) 
Thus the likelihood function is 
1 1 
exp| — — Xp)2Q-'(y — X 
aye a? 52 ¥ = XB)B ly = XB) 
and the log likelihood is 


L = p(y|X) = 


In L = ~Zin(2m) — F1n(o?) — + InjQ| Se — XB)'Q-"(y — XB) 


+ Note that the transformation matrix T, defined in Eq. (8-16), differs by a scale factor from that 
defined by TVT’ = L 
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Maximizing In L with respect to B implies minimizing the weighted sum of 
squares 
(y — XB)'Q"'(y — XB) = yQo'y — 2B'X'Q"'y + B’X'R-'XB 
Differentiating with respect to B and equating to zero gives 
b, = (x‘2-'x) 'x’a-'y 

as in Eq. (8-18). 

An unbiased estimator of o? may be derived from the application of OLS to 
Eq. (8-14). It is 

(2 (Ty — TXb,)'(Ty — Txb,) 


n—k 
= (y — Xb,)’T’'T(y — Xb,) 
n= Kk 
_ (y = Xb,)'2"'(y — Xby) 
We 
_ yQ"'y — b,x'Q" ly 
= — (8-22) 


On the assumption of normality for the disturbance term all the inference 
procedures of Chaps. 5 and 6 carry through for this model. Thus the test of 


Hy: RB=r 


is based on 


Pa Rb,)'[R(X'@"'X) 'R'] '(r — Rb,)/q 
- = 


having the F(q,n — k) distribution under the null hypothesis, where b, is the 
GLS estimator defined in Eq. (8-18) and s? the variance estimator defined in Eq. 
(8-22). 

The above formulas are only operational if the elements of & are known. In 
some exceptional cases ‘this may be so, but in most practical cases it is not. We 
must therefore proceed to the development of operational procedures for such 
cases, but there is, in fact, no single procedure which is generally applicable. One 
must look for the procedure which is best suited to the features of each specific 
problem in turn, and that is done in the remaining sections of this chapter. 


8-4 HETEROSCEDASTICITY 


We have already mentioned in Sec. 8-1 the possibility of heteroscedastic dis- 
turbances in cross-section studies. Heteroscedasticity may also arise in dealing 


with grouped data. Suppose the model is 
Y,=a+BX,t+u, t=1,...,2 
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where the u, are homoscedastic with zero covariances. However, suppose we only 
have access to data which have been averaged within m groups, where n, indicates 


the number of observations in the ith group. The form of the model appropriate 
to the data is now 


and clearly 


var(u,) = — i=1,...,m 
Thus 
ns 0 0 
a 
1 
o-R=ia"] 0) ay --- 0 (8-23) 
ny 
0 0 atts 
n 


where & is known and the GLS estimator can easily be computed. 


Example 8-1 We have taken the same X, Y data as in Example 2-1, only now 
it is assumed that they relate to group means. The n, column indicates the 
number of observations in each group. The overall means are easily computed 


from 
z_ um X, _ 202 
= SST he FeReee 
=_ Ln¥, _ 400 
Fae cay yee 


which are almost identical with the simple means of 4 and 8 in Table 2-1. We 
assume that Eq. (8-23) is the appropriate assumption about var(U), that is, 
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Thus 
ny 12 
ny 6 
Q'= = 11 
10 
ms 11 
It may then be seen that 
n, 1 x, 
xa-'x : L 42 a? 
(Lh ® Xs : 
ns\ly x, 
2} da, mn, X, 
En,X, In, X? 
mn, 
and x’'Q7'y = ie 
* |xn XY, 


Formula (8-18) for the GLS estimator now simplifies to 


In, £1, X, mn,Y, 
In,X,  n,X; mn, X,Y, 


<= 


which is a form of weighted least squares. Applying the data from Table 8-1 
gives 
50b,% + 2026, 4 = 400 


2025, + 12546, 4 = 2388 


with solution b,, = 0.88 and b,, = 1.76. To obtain the sampling variance of 
these estimates, substitute for Q~! from Eq. (8-23) in Eq. (8-22) to obtain for 


Table 8-1 
x, Yom mK, i | XP RN YP 
2 4 12 24 48 48 96 192 
3 7 6 18 42 54 126 294 
1 3 ll ll 33 11 33 99 
5 9 10 50 90 250 450 810 
9 17 11 99 187 891 1683 3179 
Sums 50 202 400 1254 2388 4574 
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this example 
s rn.Y, 
(n= k)s? =Un¥? = [b)¢ iat 
¥, J 400 
= 4574 — [0.8791 1.7626]| ,40°| 
= 13.2712 
Thus = a = 4.4237 


Notice that the n which occurs in the denominator of the variance formula, 
Eq. (8-22), is the number of sample points. It is nor the total number of 
observations underlying the sample points. In this example, the latter number 
is Yn, = 50, but n = 5. Finally, substitution in Eq. (8-19) gives 


var(b,) = s?(X’Q-'x)"! 
_ 50. 202]7! 
if 4.4237| 202 1254 


a 0.057271 + —0.009225 
a 4.231 —0.009225 0.002284 


0.2533 —0.0408 
— 0.0408 0.0101 


Thus var(b,,4) = 0.2533 
var(b, 4) = 0.0101 


This example might have been treated equivalently by finding the T 
matrix satisfying T’T = Q-'. Given Q>', the T matrix is simply 


if, 


ie 


ins 
Thus the data of Table 8-1 could have been recorded as 
X, i236 «Wil «sV¥i0~—9/ir 
¥ 4V12. 76 «3V11_~«s«oV10.—Ss i 
and OLS applied to these five pairs of numbers. 
A different variant of a cross-section study is one with replication of the Y 
variable for given values of X. Suppose, for instance, that agronomists are 


investigating the variation of crop yield in response to varying applications of 
fertilizer. Let X,,..., X,,..., X,, denote the different fertilizer dosages chosen for 
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the experiment. For dosage X;,”, plots are chosen, and ¥,; (j = l,.-.,7;) 
denotes the resultant set of n, yields. A model for the linear effect of fertilizer on 


yield would then be specified as 
Yj, =a+ BX,+u,, t= 1,...,m, j= 1,..., 0; (8-24) 


Denoting the vector of disturbances in the ith application by u,;, we make the 
conventional assumptions 


E(u,)=0 and E(uyi)=o71,  i=1,...,m (8-25) 


Thus Eq. (8-25) allows the disturbance variance to be different in different 
applications, but assumes homoscedasticity and zero covariances within applica- 
tions. However, an additional assumption is now required to cover the relations 
between disturbances in different applications. We assume these to be uncorre- 
lated, that is, 


E(uw,)=0 i,r=1,...,m; i*r (8-26) 
The complete model may now be written 
y xX, uy 
y: x u ’ 
Plate Malt]: (8-27) 
Yn} [Xn Up 
where 
ix 
Le 
Xx, = " i=1,...,m 
oe 
A more compact form of Eq. (8-27) is 
y=XBpt+u 
where y’ =[y; ys <* Ym}, and soon. Assumptions (8-25) and (8-26) produce 
a block-diagonal form for var(u), namely, 
Oe ome 0 
var(u)=| 0 o71,, <-- 0 (8-28) 
Beatie ca 


Notice that each X, submatrix has only unit rank, since the same dosage is 
applied to all plots within the group, but the X matrix has full column rank, 
Model (8-27) is a special case of a more general model, which may be written 


yi X, |. uy, 
=|: [B+] : (8-29) 
Voi Xm U, 
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where each X, is of order n, x k, the rows of X,; are not required to be identical, 
and the assumption is still made that var(u) has the block-diagonal structure given 
in Eq. (8-28). For example, Y, , might represent investment expenditure by firm i 
in year j and the X’s the variables thought to influence investment expenditures. 
The study would thus cover m different firms in various years, that is, a pooling of 
time-series and cross-section data. Model (8-29) assumes a common set of 
reaction coefficients B for all firms, but assumption (8-28) would allow the 
disturbance variances to differ across firms.¢ 


Testing for Heteroscedasticity 


Test for the equality of variances. In the case of replicated data, model (8-27), and 
u, ~ N(0, o/I,,), a standard test for the equality of variances is available. The 
hypothesis of homoscedasticity is 

Hy: of =o? =--- =6 


The test is conducted as follows. 


1. For each class or group, compute the within-group sample variance 


. 12 
op = Delhy = ¥) i=1,...,m 


y, 


where », = n, — 1, and 


where 


m 
LA Dy (n, ‘I 1) 
i=1 
and the quantity 
m 
Q’ = vins? — Y vIn s? 
i=1 
Under the null hypothesis Q’ will be approximately distributed as x2(m — 1). 
However, the approximation will be improved by dividing Q’ by the scaling 


} The zero covariances incorporated in Eq. (8-28) may be an oversimplification for this model. See 
Sec. 8-6, Sets of Equations, and Sec. 10-3, Pooling of Time-Series and Cross-Section Data. 

See M. G. Kendall and A. Stuart, The Advanced Theory of Statistics, vol. 2, Griffin, London, 
1961, pp, 234-236. 
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constant 
1 “| 1 
C=1+ Soot 
3(m — 1) ea if | 
to give 
08 
G 


. If @ > x3.95(m — 1), say, then the null hypothesis would be rejected at the 5 
percent level of significance. 


Example 8-2 The data of Table 6-1 do not fit this test exactly since the X 
variable is not constant within each class. However, we will ignore this 
discrepancy and use the Y data of Tables 6-1 and 6-2 to illustrate this test. 
We have v,; = 4 for each i and 


im 
v=)»y=16 m=4 
i=l 
Thus 
c=14 54 (1- 7g) = 14100 
EE ena ee 
pees, s2=235 sz=515 sz=45 
Lag 2 Bs 3 3 4 x 


Ins? = 3.0910 Ins} = 3.1570 Ins = 3.9416 In sf = 1.5041 
Lyin 5? = 46.4868 


vIn s* = 51.7402 
Thus 


_ 51.7402 — 25.3750 


clam Tamm ie 


which exceeds 
X6.95(3) = 7.81 


and the hypothesis of homoscedasticity would be rejected on these numbers. 
Let us repeat, however, that these calculations were for illustrative purposes 
only since the within-group variability of X violates a basic assumption 
underlying the test. A more appropriate procedure would be to replace the 
within-group Y variances by the residual variances around the within-group 
regressions of Y on X, that is, compute 

Bias (y= Xib)'(yi = Xb) = | 


sha 5m 
- n,;—k a 
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where 
b, = (X;X,) 'Xiy, 


i 
The Breusch-Pagan test.; Situations with replicated observations are rare in 
practice. The more common situation is one of single observation points. The 
Breusch-Pagan test for heteroscedasticity is a very general test in that it covers a 
wide range of heteroscedastic situations, and it is also a very simple test in that it 
is based on the OLS residuals. The model is 


y=XB+u 


where the disturbances u, are assumed to be normally and independently distrib- 
uted with variance 


0; = h(za) (8-30) 


where h(-) denotes some unspecified functional form, a is a p X 1 vector of 
coefficients unrelated to B, and z, is a p X 1 vector of variables thought to 
influence the heteroscedasticity. The first element in z, is taken to be unity. Thus 
the null hypothesis 


Ay: a =a,=+:- =a, =) 


specifies homoscedasticity since then of = h(a,), which is constant over all i. The 
other Z variables may consist partially, or even exclusively, of X variables, that is, 
the heteroscedasticity may be governed by the explanatory variables in the 
structural relationship. The test is a large sample or asymptotic test and is 
conducted as follows. 


1. Fit the OLS regression of y on X and obtain the vector e of OLS residuals. 
2. From e compute 


Dr_ je? 
2 ap alee 
n 
and the series 


a Her ly. ..,.22 


3. Specify the variables in the vector z,. Notice that the functional form h(-) in 
Eq. (8-30) does not have to be specified, merely the variables in the linear 
combination z’«. Then fit the Tegression of g, on z’, and compute the 
explained sum of squares (ESS) from the regression. 

4. The quantity @ = ESS/2 is, under the null hypothesis, asymptotically distrib- 


uted as x?( p — 1). Thus if @ > Xb.s(P — 1) one would reject the hypothesis 
of homoscedasticity at the 5 percent level. 


The Goldfeld-Quandt test. An especially simple and finite sample test, which is 
applicable if it is thought that one of the X variables is the basic explanation of 


7T. S. Breusch and A. R. Pagan, “A Simple Test for Heteroscedasticity and Random Coefficient 
Variation,” Econometrica, vol. 47, 1979, pp. 1287-1294. 
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heteroscedasticity, is the Goldfeld-Quandt test.} Suppose it is suspected that o/ is 
positively related to one of the X variables, say X,. The test procedure is then as 
follows. 


1. Reorder the observations by the values of X,. 

2. Omit ¢ central observations. 

3. Fit separate regressions by OLS to the first and last (n — c)/2 observations, 
provided, of course, that (n — c)/2 exceeds the number of parameters to be 
estimated. 

4. Let RSS, and RSS, denote the residual sum of squares from the two 
regressions, with the subscript 1 denoting that from the smaller X; values and 
2 that from the larger X, values. Then 

RSS, 

RSS, 


will, on the assumption of homoscedasticity, have the F distribution with 
((n — ¢ — 2k)/2, (n — ¢ — 2k)/2) degrees of freedom. Under the alternative 
hypothesis R will tend to be large. Thus if F > Fy9;, one would reject the 
assumption of homoscedasticity at the 5 percent level. 


R= 


The power of the test will depend, among other things, on the number of 
central observations excluded, and will clearly be small if ¢ is too large (so that 
RSS, and RSS, have very few degrees of freedom) or too small (so any possible 
contrast between RSS, and RSS, is reduced). A rough guide is to set c at 
approximately 1/3. 


The Glesjer test. None of the previous tests yields any specific estimate of the 
form of heteroscedasticity which could then be inserted in var(u) to help derive 
the GLS estimator. A test which helps in this direction is that due to Glesjer.§ It 
is suggested, however, only for the case where a single variable Z is presumed to 
determine the heteroscedasticity. The Z variable may, of course, be one of the 
explanatory X variables in the structural relation. The test proceeds as follows. 


1. Fit the OLS regression of y on X and derive the residual vector e. 
2. Regress the absolute value of the OLS residuals on Z", that is, 


Je,| = 8) + 8,Z} + error (8-31) 


+S. M. Goldfeld and R. E. Quandt, “Some Tests for Homoscedasticity,” Journal of the American 
Statistical Association, vol. 60, 1965, pp. 539-547; or S. M. Goldfeld and R. E. Quandt, Nonlinear 
Methods in Econometrics, North-Holland, Amsterdam, 1972, Chap. 3, for a more general discussion. 

$See A. C. Harvey and G. D. A. Phillips, “A Comparison of the Power of Some Tests for 
Heteroscedasticity in the General Linear Model,” Journal of Econometrics, vol. 2, 1973, p. 312. 

§ H. Glesjer, “A New Test for Heteroscedasticity,” Journal of the American Statistical Association, 
vol. 64, 1969, pp. 316-323. 

4 Alternatively one might use e? as the dependent variable. 
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As it stands, this relation is nonlinear in 5), 6,, and A. Glesjer suggests trying 
regressions for a few specific values of h, such as 1, —1, 3. The estimated slope 
coefficient 6, is then used to test the hypothesis that 6, is zero, although the 
conditions required for the validity of the usual significance test will not, in fact, 
be satisfied by this regression. Acceptance of Hy: 5, = 0 implies homoscedas- 
ticity and its rejection, heteroscedasticity. 


Estimation under Heteroscedasticity 


1. Example 8-1 has illustrated a simple case of GLS estimation for grouped 
data, where the 2 matrix was known. 

2. Another simple case occurs where one of the explanatory variables de- 
termines the heteroscedasticity. This may have been determined by a Glesjer 
type regression or postulated on a priori grounds. Suppose the heteroscedas- 
ticity is modeled by 

y 0, = 07X? Cleon (8-32) 


where X, is the explanatory variable thought to be the source of the hetero- 
scedasticity. The variance matrix of the disturbance term then takes the form 


Xi 0 0 
var(u)= 07] 0 XJ 0 
Pr ig RAH i pe 


1 
pa ett! 0 
x, 
T=| 0 a ) 
= Xp 
sineoag et $I yeaa tes lettifassere vite i 
0 — 
0 X, 


Thus the original relation 


Y= By HB Xap oo + BX, bo + BEX, + u, 


would be transformed for estimation Purposes to 


Y, 1 X, 
(a) A(a) A(R) +a Se) «(2 


+ See the discussion in S. M. Goldfeld and R. E. Quandt, Nonlinear Methods in Econometrics, 
North-Holland, Amsterdam, 1972, pp. 92-94. 
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and the inference procedures of Chap. 5 could then be validly applied to the 
transformed variables in Eq. (8-33). Notice, however, that B, is estimated by 
the intercept in the transformed relation, and the original intercept 8, is 
estimated by the coefficient of 1/X;. Equivalently, the diagonal matrix 


may be inserted along with the original y, X data in Eqs. (8-18) and (8-19). 

3. In cases 1 and 2 we have assumed the elements in the 2 matrix to be known 
exactly. In many realistic cases these elements have to be estimated and the 
estimates then substituted in the GLS formulas. This is sometimes referred to 
as a two-stage Aitken estimator (2SAE), or as a feasible GLS (Aitken) 
estimator. For example, in the case of replicated data the within-group, 
sample Y variances could be estimated and substituted in Q. Or if a 
Glesjer-type assumption postulated 


o7 = 6) + 5X, 
and these parameters were estimated from 
e? = 6, + 6, X, + error 


the estimated disturbance variance matrix would be 


Var(u) = diag(§ + 8,%,, 8) + 61%... 8 + 8,X,) 


An unfortunate consequence of replacing unknown disturbance parameters 
by estimated values is that we no longer have exact finite sample results for the 
resultant estimators. In general the small sample properties of feasible GLS 
estimators are unknown. There is, however, some evidence from Monte Carlo 
studies on the relative performance of feasible GLS estimators compared with 
OLS.+ Furthermore, under fairly general conditions the feasible GLS estimators 
will have the same asymptotic distribution as the GLS estimators, so that the 
conventional tests may be given an asymptotic justification.+ 

The importance of adjusting for heteroscedasticity depends on the extent of 
the departure from homoscedasticity. No general pronouncements can be made, 
but as a very simple illustration consider a two-variable model 


Y,=a+ BX,+ u, 
where X, takes on the values 1,2, 3,4, 5. Let b denote the OLS estimator of B and 
by the GLS estimator and let us assume further that the nature of the hetero- 
scedasticity is 


+ For a summary and detailed references see G. G. Judge, W. E. Griffiths, R. C. Hill, and T. C. 
Lee, The Theory and Practice of Econometrics, Wiley, New York, 1980, Chap. 4. 

+ For the condition under which there is asymptotic equivalence see P. Schmidt, Econometrics, 
Marcel Dekker, New York, 1976, Chap. 2, or H. Theil, Principles of Econometrics, Wiley, New York, 
1971, Chap. 8. Unfortunately these conditions need to be checked out for each specific application. 
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A straightforward application of Eqs. (8-13) and (8-19) then gives 
var(by) _ 0.69 


=, = 0.56 8-34 
var(b) 1.24 cae) 
and if the form of the heteroscedasticity is 
0, ca 0°X, 

the corresponding result is 

var(b,) 

——_ 7 0) 83 8-34b 

var(b) ( ) 


Thus in this illustration, the efficiency of the OLS estimator ranges from 56 to 83 
percent of the GLS estimator, depending on the postulated range for the 
heteroscedasticity. Finally, we may note that in the heteroscedastic case, and in 
other cases where GLS estimation is appropriate, there is no unique measure of 
goodness of fit. A measure may be based on weighted sums of squares, using Q~' 
(or V~') asa weighting matrix, or on sums of squares of the transformed vector 
Ty, although the latter is inappropriate if the transformed relation does not have 
an intercept term. For details of these and other measures the reader should 
consult the article by Buse.} 


8-5 AUTOCORRELATION 


Definitions 


The autocorrelation, which is the focus of this section, is that of the {u,} series. 
There may or may not be autocorrelation in the explanatory variables, but for the 
moment we are only concerned with possible autocorrelation in the disturbance 
term. When present, it results in some or all of the off-diagonal terms in the var(u) 
matrix being nonzero. This in turn destroys the optimal properties of OLS and 
gives rise to another application of GLS. 

We assume, as usual, zero mean for the series, that is, 


E(u,)=0 — forallr 
The autocovariance at lag s is defined by 
¥, = E(uju,”,) 4 *s0/t-1, +2,... (8-35) 
At zero lag we have simply the constant variance of the series 
Yo = Eu? = 0; 


The autocorrelation coefficient at lag s is defined by 


pom s= $1, 42,.,. (8-36) 
Yo 


+ A. Buse, “Goodness of Fit in Generalized Least Squares Estimation,” The American Statistician, 
vol. 27, 1973, pp. 106-108. 
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We note that the y’s and p’s are symmetrical in s and have been assumed to be 
independent of the ¢ subscript, that is, these coefficients are constant over time 
and depend only on the length of lag s. The variance matrix for the disturbance 
term may then be written as 


Yo N Ya Yn-1 
var(u) =] 1 Yo N Yn-2 
Yn-1 = Yn-2— Yn-3 Yo 
1 Py P2 Pn—-1 
=o2] py 1 Rice Wei aba (8-37) 


The above exposition has implicitly assumed a temporal, or time-series, 
framework, but the same phenomenon may arise with cross-section data, where it 
is often referred to as spatial autocorrelation. Suppose a sample of six observa- 
tional units is represented by Fig. 8-1. The units might actually be contiguous as 
in the case of adjoining states, or “nearness” might be defined in terms of some 
other variable. For instance, if the sample units were households, the first 
household might have an income close to the incomes of the second and third 
households but not close to the incomes of the remaining households. If one 
hypothesized that the disturbance for the ‘th unit was related to the disturbances 
of contiguous units, then 

uy =f (uz, u3) 

Uy = f(y, U3, Ua, us) 
and:so on, and we would have some nonzero terms in the off-diagonal positions in 
var(u). 

Estimation of var(u) as in Eq. (8-37) from any finite sample is impossible 
since the number of unknowns exceeds the number of observations, nor is any 
relief afforded by increasing the number of observations, as it brings a concom- 
itant increase in the number of unknowns. The practical procedure is to secure a 
reduction in the number of unknown parameters by postulating some structure 


ee Figure 8-1 A sample of “contiguous” units. 
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1.0 


0.5 


Figure 8-2 Correlogram for AR(1) with 
1 2 3 4 $=0.5, 


for the disturbances. In time-series applications simplified structures are typically 
autoregressive (AR) processes, moving average (MA) processes, or joint autore- 
gressive, moving average (ARMA) processes. We have already had an example of 
an autoregressive process in Eq. (8-7), namely, 


4 =u, +e |p] <1 


where the coefficient of the lagged term is denoted by ¢ as we now wish to use p 
to denote an autocorrelation coefficient. This is a first-order AR(1) process, and 
the result already established in Eq. (8-10) gives us the autocorrelation function 
(ACF) of the process as 

Rae srentOl Des (8-38) 


Thus the autocorrelations decay exponentially and will oscillate in sign if is 

negative. The graph of the autocorrelation function is called the correlogram, and 

a typical correlogram for the AR(1) process, with positive ¢, is shown in Fig. 8-2. 
The AR(2) process is defined as 


U, = oU,_, + bou,_> + &, (8-39) 


In these and all other applications the (e,) series is always assumed to be 
“well-behaved,” that is, E(e)=0 and E(ee’) = 0/1, The condition || < 1 
ensured that the AR(1) process had a finite variance. The resultant {u,) series was 
an example of a stationary process.} In the second-order case the conditions for 
stationarity are$ 


Ie.| <1 $+, <1 $2 —$, <1 
To establish the autocorrelation coefficients of the AR(2) process, multiply Eq. 
(8-39) by u,_, and take expectations, giving 
: Ys = PiYs—-1 + H2%,—2 s>0 


since e, has zero covariance with all Previous u’s. Dividing through by the 


+ A stationary process has a constant and finite mean and variance and a set of covariances which 
are independent of time and are functions only of the lag length. 

+G. E. P. Box and G. M. Jenkins, Time Series Analysis Forecasting and Control, revised edition, 
Holden-Day, San Francisco, 1976, p. 58. 
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variance Yo of the series gives 


Ps = >P,-1 + b2P,-2 5 > 0 (8-40) 
Setting s = 1 and using py = | and p_, = p,, it follows that 
rs 
PL = fee (8-41) 


Setting s = 2 and using Eq. (8-41) then gives 


(8-42) 


These first two autocorrelations, in conjunction with the recurrence relation 
(8-40), will yield the higher order autocorrelations. The stationarity conditions 
ensure that the autocorrelation function decays as the lag length increases. To 
obtain the variance of the wu series, square both sides of Eq. (8-39) and take 
expectations, giving 

9, [1 — $1 — $3 — 261420] = 02 
Substituting for p, from Eq. (8-41) and simplifying gives the result 


ges (l= ®): a, (8-43) 
“(1+ ¢)[0 - a) - #] 
The general AR process of order p is defined as 
Uy = Pil 1 + GoM; -2 + +++ + OM yp + & (8-44) 
where the (e,) series is well behaved, and suitable conditions are imposed on the 


’s to ensure stationarity. 
The general MA process of order q is defined as 


u, =, + Oe,_, + Oe. +++: + Oe), (8-45) 
An example of a fourth-order process was given in the context of a wage change 
model in Eg. (8-3), and the corresponding variance matrix in Eq. (8-5) showed 
that the autocorrelation coefficients are zero for all lags greater than the order of 
the MA process. Consider the MA(1) process 
u, = &, + Be,_, (8-46) 
It follows directly that 
Yo = 9% = E(u7) = (1+ 8?)op 


and y, = E(u,u,_,) = 007 
aa Voy aaueir 
so that Py a eee 


and all further autocorrelation coefficients are zero. wo 
A finite-order MA process may be converted into an infinite-order AR 


process and vice versa. For instance, using the lag operation L, Eq. (8-46) may be 
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rewritten as 
u, = (1+ OL )e, 


giving (1+ OL) 'u, =e, 
or (1 - OL + 671? —.--)u,=e, 
or u, = Ou,_, — 07 u,_> + Ou,_,—--- +8, 


which is an infinite AR process with the restriction that the coefficients are given 
by the successive powers of @ with alternating signs. Similarly the AR(1) process 
may be written 


(1=9L)u, =e, 
or u, = (1 > gL) 'e, = (1+ OL 4+ GL? +---)e, 
or . u, =e, + be, , + Pe, 4 + +: 


which is an infinite MA process with the coefficients given by the successive 
powers of }. The general ARMA( p, q) Process is defined by 


Up Puy) FF Gu Pe Foe pe ey 8,€ (8-47) 


‘-¢. 


This may be written more compactly, using polynomials in the lag operator, as 


$(L)u, = 6(L)e, (8-48) 
where $(L)=1 —$oL-$L? - Csiaduceik sy Be 
and OL) =14+ OL + OL? +--+ 6,19 


The mixed ARMA formulation enables quite complicated processes to be repre- 
sented by a suitable choice of low-order polynomials. 

Returning to the spatial example in Fig. 8-1, a common hypothesis about the 
Structure of the autocorrelation is} 


ui = PLiw,a, +e; (8-49) 
J 
where 
moe 
(1 a * 
i x, Wi 
and 
rye { 1 if units i, j are contiguous 
4 0 otherwise 


In this formulation p is a scalar indicating the overall Strength of the autocorre- 
lation, and the weights w;; are essentially dummy variables which allow any 
disturbance to be affected by contiguous disturbances. The matrix formulation of 


} See, for example, R. L. Martin, “On Spatial Dependence, Bias and the Use of First Spatial 
Differences in Regression Analysis,” Area, vol. 6, 1974, pp. 185-194. 
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Eq. (8-49) is 
u=pWute (8-50) 
and for Fig. 8-1 the W matrix would be 
0 + 4+..0 0% 
AO Aner, PP AED 
Sp ilksiy dun ences 
On ieanvO aaa 
Ow “Os OO 
0.0 Fe Oe 0 


From Eg. (8-50) the variance matrix of the disturbance term is 
var(u) = 02(1 — pW) ‘(I — pW)’-! (8-51) 


In Eq. (8-51) W is generally a known matrix, but p and o? are unknown scalars. 
However, we will not consider the resultant estimation problems here.} 


Reasons for Autocorrelated Disturbances 


Some possible reasons have already been mentioned in Sec. 8-1. A general source 
of autocorrelated disturbances is the fact that the disturbance represents the net 
influence of omitted explanatory variables. Economic theory cannot prescribe an 
exhaustive list of explanatory variables to be included in a relation and, in any 
case, data limitations often curtail the number of variables that can be included. 
Exclusion of variables would not of itself impart autocorrelation to the dis- 
turbance term unless the excluded variables were autocorrelated. Even then 
autocorrelation in one explanatory variable might offset that in another. However, 
economic variables tend to be nonrandom over time and also to move roughly in 
phase so that excluded variables may impart autocorrelation to the disturbance 
term. A second source of autocorrelation may be a misspecification of the form of 
the relationship. Suppose the true relationship is represented by line A in Fig. 8-3, 
and the linear function B is fitted to the data. The sample points will be scattered 
around 4 and so the residuals from B will tend to be positive for X < X,, 
negative for X, < X < X,, and positive again for X > X,. Such a case might be 
spotted simply from an inspection of the scatter diagram, and a transformation to 
a quadratic or other nonlinear function would yield random disturbances. Even in 
the case shown in Fig. 8-3 the X variable would need to be reordered 
in increasing size for the correlation between adjacent residuals to show up. In 
multiple regression situations misspecifications of the form of the equation cannot 
usually be detected by graphical means, and some experimentation with different 
functional forms may be required to judge whether any autocorrelation that 
shows up may be a reflection of specification error. 


7 For a discussion of estimation procedures see A. J. Cliff and J. K. Ord, Spatial Processes, Models 
and Applications, Chapman and Hall, London, 1981. 
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rewritten as 
u, = (1+ OL)e, 
giving (1+ OL) 'u, =e, 
or (1 — 6L + 67L? —---)u, =e, 
or u, = Ou,_, — 07u,_2 + O3u,_, —--- +2, 


which is an infinite AR process with the restriction that the coefficients are given 
by the successive powers of # with alternating signs. Similarly the AR(1) process 
may be written 


a — oL)u, = &; 
or u,=(1-¢L) ‘ec, =(1+ ¢L + gL? +---)e, 
or u, =e, + pe,_, + oe,_. + °° 


which is an infinite MA process with the coefficients given by the successive 
powers of @. The general ARMA( p, q) process is defined by 


Up = u,_, + ++ + bu,_, +e, + Oe, | +->- + Oe, (8-47) 


This may be written more compactly, using polynomials in the lag operator, as 


o(L)u, = O(L)e, (8-48) 
where o(L) =1-9,L—- $1? = ++» — 9,1? 
and O(L)=14+6,L+ 6,1? +--+ 6,19 


The mixed ARMA formulation enables quite complicated processes to be repre- 
sented by a suitable choice of /ow-order polynomials, 

Returning to the spatial example in Fig. 8-1, a common hypothesis about the 
structure of the autocorrelation is} 


4, = pL wu, +e (8-49) 
J 
where 
wi 
7 Ei 
and 
Ps ( 1 if units i, j are contiguous 
) otherwise 


In this formulation p is a scalar indicating the overall strength of the autocorre- 
lation, and the weights w,, are essentially dummy variables which allow any 
disturbance to be affected by contiguous disturbances. The matrix formulation of 


+ See, for example, R. L. Martin, “On Spatial Dependence, Bias and the Use of First Spatial 
Differences in Regression Analysis,” Area, vol. 6, 1974, pp. 185-194. 
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Eq. (8-49) is 
u=pWu + ¢ (8-50) 
and for Fig. 8-1 the W matrix would be 
64 4.0. OMe 
to 44 4.0 
Wie Kin te Retake 
ae ee ee 
Out O F 040 
0 0 bid oh 0 
From Eg. (8-50) the variance matrix of the disturbance term is 
var(u) = o2(1 — pW) ‘(I — pW)’~! (8-51) 


In Eq. (8-51) W is generally a known matrix, but p and o? are unknown scalars. 
However, we will not consider the resultant estimation problems here.+ 


Reasons for Autocorrelated Disturbances 


Some possible reasons have already been mentioned in Sec. 8-1. A general source 
of autocorrelated disturbances is the fact that the disturbance represents the net 
influence of omitted explanatory variables. Economic theory cannot prescribe an 
exhaustive list of explanatory variables to be included in a relation and, in any 
case, data limitations often curtail the number of variables that can be included. 
Exclusion of variables would not of itself impart autocorrelation to the dis- 
turbance term unless the excluded variables were autocorrelated. Even then 
autocorrelation in one explanatory variable might offset that in another. However, 
economic variables tend to be nonrandom over time and also to move roughly in 
phase so that excluded variables may impart autocorrelation to the disturbance 
term. A second source of autocorrelation may be a misspecification of the form of 
the relationship. Suppose the true relationship is represented by line A in Fig. 8-3, 
and the linear function B is fitted to the data. The sample points will be scattered 
around A and so the residuals from B will tend to be positive for X < X,, 
negative for X, < X < X,, and positive again for X > X,. Such a case might be 
spotted simply from an inspection of the scatter diagram, and a transformation to 
a quadratic or other nonlinear function would yield random disturbances. Even in 
the case shown in Fig. 8-3 the X variable would need to be reordered 
in increasing size for the correlation between adjacent residuals to show up. In 
multiple regression situations misspecifications of the form of the equation cannot 
usually be detected by graphical means, and some experimentation with different 
functional forms may be required to judge whether any autocorrelation that 
shows up may be a reflection of specification error. 


+ For a discussion of estimation procedures see A. J. Cliff and J. K. Ord, Spatial Processes, Models 
and Applications, Chapman and Hall, London, 1981. 
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A third possible source of autocorrelated disturbances may be measurement 
error in the dependent variable. Economic statisticians typically have various 
formalized routines and procedures for estimating (or some say guesstimating) 
economic magnitudes. The sequential publication of revised estimates is eloquent 
testimony to the fact that the creators of the series believed their products to 
contain some error, and indeed a series becomes definitive simply when the 
statisticians stop revising it, which is not to say that it is then free of error. It is 
unlikely that the estimating procedures produce errors which are random from 
period to period and so, letting the y vector denote the observed Y values and yx 
the true Y values generated by the mechanism XB + u, we have 


y=yetv=XB + (u+v) 
where y is a vector of measurement errors. In the observed relationship the 


disturbance term is u + y, which may exhibit autocorrelation through u or y or 
both. 


Consequences of Autocorrelation for OLS 


The consequences of applying OLS to a relationship with autocorrelated dis- 
turbances are qualitatively similar to those already derived for the heteroscedastic 
case, namely, unbiased but inefficient estimation and invalid inference proce- 
dures. It is possible to illustrate some of these factors quantitatively for certain 
simple cases. Consider the model 
y, = Bx, + u, (8-52) 
with u, = pu,_, + & 
where 
E(e)=0 and E(ee’) = 071 

If OLS is applied to Eq. (8-52), we know from Eqs. (8-11) and (8-13) that 

var(OLS b) = 02(X’X) 'X’'@X(X’X) | (8-53) 
where 


(8-54) 
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and, as may be verified by multiplying out, 


1 —p 0 0 0 0 
—p 1+” ae 0 0 0 
os 0 1+ p° 0 
rpm (at) pear lead phat) alee Dnt e 
0 0 0 —p 1+p? -p 
0 0 0 0 —p 1 
Substituting Eq. (8-54) in Eq. (8-53) for the model of Eq. (8-52) gives 
var(OLS b) 
Ze a; ice rete Xe +2 a bret XeXr42 dos soeeegena! X1Xn 
Lreyxa Dai, ike Dh ixf 
(8-56) 


If p were known and GLS was applied to Eq. (8-52), then substitution of Eq. 
(8-55) in the general formula 


var(b,) = o2(X'2-'X) | 
gives 
2 l= 


6, 
var(GLS by) ==— es = 
0) Sm x1 t pt = 2pLM ee /L tm X 


Comparison of Eq. (8-57) with Eq. (8-56) shows that the efficiency of OLS is 
measured by the ratio of the two terms in parentheses and thus depends not just 
on p but also on the nature of the x variable. Let us suppose that x follows a 
stable AR(1) scheme with parameter A. As the sample size n gets very large, the 
terms involving x are approximated by the autocorrelation coefficients of x, which 
are simply the successive powers of }. Thus the asymptotic efficiency of OLS is 


given by 


(8-57) 


sk ists 
aay ol OLS (1 + p? — 2pA)(1 + 2pd + 2p*? +-:-) 


Bs i ARES oe) (eee (8-58) 
(1 + p? — 2pA)(1 + pr) 


Table 8-2 shows illustrative values of this asymptotic efficiency for selected values 
of p and A. Looking first at the right-hand side of the table, where p and are 
both positive, it is clear that p is the dominant parameter. Efficiency declines from 
90 percent to about 10 percent as p rises from 0.2 to 0.9, with variations in A 
having a relatively minor effect. The diagonal entries are equal to those in the first 
row since if A= 0 or if p =A, the efficiency measure simplifies to (1 — p’)/ 
(1 + p2). Looking at the left-hand side of the table, the efficiencies are symmetri- 
cal across the first row where the (x,} series is random. The remaining rows show 
that ) now exerts a much stronger effect and that the combination of a positive A 
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Table 8-2 Asymptotic efficiency of OLS 5 (percent) 


p 
ny -09 —0.8 -0.5 -02 0.2 05 0.8 0.9 
0.0 10.5 22.0 60.0 92.3 92.3 60.0 22.0 10.5 
0.2 12.6 25.4 63.2 92.9 92.3 58.4 19.8 91 
0.5 18.5 34.4 14 94.6 93.5 60.0 18.4 1S 
08 35,9 56.2 85.4 97.5 96.6 71.4 22.0 8.4 
0.9 52.8 118 92.0 98.7 98.1 813 29:3 10.5 


and negative p can moderate the dramatic declines in efficiency shown in the 
right-hand side of the table. These calculations are, of course, only illustrative, but 
they indicate the possibility of a serious loss in efficiency if OLS is applied in the 
context of autocorrelated disturbances. 

A second problem with the application of OLS is that the conventional 
formula on the computer for var(b) will, in this example, estimate o2/L/x?, 
whereas Eq. (8-56) shows that this is no longer the true variance. As the sample 
size gets very large, the ratio of the conventional formula to the true variance is 
given by 
1—paA 


7 T+ pa 


Thus the proportionate bias that the conventional program will impart to the 
estimation of the true sampling variance of the OLS b is, in the limit, 


—2pr 


Ton (8-59) 


Asymptotic proportionate bias = 


Table 8-3 shows values of this statistic for selected values of p and \. Again it is 
instructive to consider the table in two halves. A positively autocorrelated 
disturbance in conjunction with a positively autocorrelated (x) series implies 
underestimation of the sampling variance by the conventional OLS formula. If 
p ==0.9, the estimated variance will only be about one-tenth of the correct 
number, which would cause a serious overestimation of t statistics and significance 
levels in conventional inference procedures. On the other hand, different signs for 
p and d will cause an overestimation of the sampling variance. Casual empiricism 


Table 8-3 Percentage bias in estimating var(b) 


p 
a - 09 —0.5 —02 0.2 05 09 
0 0 0 0 0 0 0 
0.2 43.9 22:2 83 -77 = 18,2 — 30.5 
0.5 163.6 40.0 22,2 = 18.2 —40,0 — 62.1 
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indicates that the predominant situation in applied studies is a conjunction of 
positive autocorrelation in both disturbance and explanatory variable so that 
underestimation of var(b) is the more likely situation. 

The comparison involved in Table 8-3 has implicitly taken 07 as known. In 
practice it must be estimated from the sample data, and here again there is a 
possibility of bias if the disturbance term is autocorrelated. We saw in Chap. 5 
that 


e’e = wu — u’X(X’X) ‘Xu 


Thus E(e’e) = E(u'u) — E{u’X(X’X) 'x'u} 
Now E(uu) = no? 
and 


E{u'X(X’'X)~ 'x'u) = e(tr[u’x(X’X)” 'x'ul) since the expression in 
brackets is a scalar 


= E({tr[X(X’X) 7 'X’uu’]} 
= o2tr(X(X’X) 'X’'2) 
= o2tr((X’X)'X’‘@X} (8-60) 
For the simple model being analyzed here 
pete Fea 2 UratXeXr42 


X\xX 
E(e’e) = o24n—-|1+ 2, 41 42 pero 2p tn 
see nf f AE ge bars SW peo Pee 


cols 


If p = 0, then E(e’e) = (n — 1)o2, which confirms the unbiased estimator ste 
e’e/(n — 1) of Chap. 5. If we assume the {x,} series to be a stable AR(1) process 
with parameter A, then for large n 


E(e’e) = o;(n = ; San 

If p and J have the same sign, then s? will have a downward bias as an estimator 
of 62. If, for instance, p = 0.9 = \ and n = 101, 
E(s*) = 0.91502 

Thus when p and 2 have the same sign, this bias accentuates the bias analyzed in 


Table 8-3. It is clear that autocorrelated disturbances are a potentially serious 
problem, and it is very important to be able to test for their existence. 


Tests for Autocorrelation 


Suppose that in the model y = XB + u one suspects that the disturbance term 
follows an AR(1) scheme, that is, 
u, = ou,_, + & 
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The null hypothesis of zero autocorrelation would then be set up as 
Hy: ¢=0 

and the alternative hypothesis as 
Hy: ¢=*0 


The null hypothesis is about the u’s, which are unobservable. One therefore looks 
for a test of the null hypothesis using the vector of OLS residuals, e = y — Xb. 
This raises several difficulties. We know from Chap. 5 that 


e= Mu 
where 
M =I — X(X’X)'x’ 


is symmetric, idempotent of rank n — k. Thus the variance-covariance matrix of 
the e’s is 
E(ee’) = 02M 


Thus even if the null hypothesis is true, so that E(uu’) = 021, the OLS residuals 
will display some autocorrelation, for the off-diagonal terms in M do not vanish. 
More importantly M is a function of the sample values of the explanatory 
variables, so that it is impossible to derive an exact finite sample test on the e’s 
which will be valid for any X matrix that might ever turn up. 


Durbin-Watson test. These problems were treated in a pair of classic and path- 
breaking articles.} The Durbin-Watson test statistic is computed from the vector 
of OLS residuals e = y — XB. It is denoted in the literature variously as d or DW 
and is defined as 


Lina le, = &- i) 
wreier 


Figure 8-4 indicates why d might be expected to measure the extent of first-order 
autocorrelation. The mean residual is zero, so the residuals will be scattered 
around the horizontal axis. If the e’s are Positively autocorrelated, successive 
values will tend to be close to each other, runs above and below the horizontal 
axis will occur, and the first differences will tend to be numerically smaller than 
the residuals themselves. Alternatively if the e’s have a first-order negative 
correlation, there is a tendency for successive observations to be on opposite sides 
of the horizontal axis, so that first differences tend to be numerically larger than 
the residuals. Thus d will tend to be “small” for positively autocorrelated e’s and 
“large” for negatively autocorrelated e’s. If the e’s are random, we have an 
in-between situation with no tendency for runs above and below the axis or for 
alternate swings across it, and d will take on an intermediate value. 


d= (8-61) 


+ J. Durbin and G. S. Watson, “Testing for Serial Correlation in Least Squares Regression,” 
Biometrika, vol. 37, 1950, pp. 409-428; vol. 38, 1951, pp. 159-178. 
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Figure 8-4 (a) Positive autocorrelation; (b) negative autocorrelation, 


The Durbin-Watson statistic is closely related to the sample first-order 
autocorrelation coefficient of the e’s. Expanding Eq. (8-61), 


2 2 
_ herent rare = Wpn2l1l1—1 
Sraies 
For large n the different ranges of summation in numerator and denominator 
have a diminishing effect and 


d 


d=2(1-r) (8-62) 


where r = Le,e,_,/Le? is the coefficient in the OLS regression of e, on e,_\. 
Formula (8-62) shows heuristically that the range of d is from 0 to 4: 


d <2 for positive autocorrelation of the e’s 
d > 2 for negative autocorrelation of the e’s 
d = 2 for zero autocorrelation of the e’s 

The hypothesis under test is, of course, about the properties of the unobserv- 
able u’s, which will not be reproduced exactly by the OLS residuals, but the 
above indicators are nonetheless valid in that d will tend to be less (greater) than 
2 for positive (negative) autocorrelation of the u’s. For a random uw series the 
expected value of d is 


E(d)=2+ Ak—D (8-63) 


where k is the number of variables in the regression. 

Because of the dependence of any computed d value on the associated X 
matrix, exact critical values of d cannot be tabulated for all possible cases. Durbin 
and Watson established upper (d,,) and lower (d,) bounds for the critical values. 
The tabulated bounds are to test the hypothesis of zero autocorrelation against 
the alternative of positive first-order autocorrelation. The testing procedure is as 


follows. 


1. If d<d,, reject the hypothesis of nonautocorrelated u in favor of the 
hypothesis of positive first-order autocorrelation. 
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2. If d > dy, do not reject the null hypothesis. 
3. If d, < d < dy, the test is inconclusive. 


If the sample value of d exceeds 2, we wish to test the null hypothesis against 
the alternative hypothesis of negative first-order autocorrelation. The appropriate 
procedure is to compute 4 — d and compare this statistic with the tabulated 
values of d, and d, as if one were testing for positive autocorrelation. The 
original DW tables covered sample sizes from 15 to 100, with 5 as the maximum 
number of regressors. Savin and White have published extended tables for 
6 <n < 200 and up to 10 regressors.} The 5 percent and | percent Savin-White 
tables are reproduced in App. B-5. 

There are two important qualifications to the use of the Durbin-Watson test. 
First it is necessary to have included a constant term in the regression. Second, it 
is strictly valid only for a nonstochastic X. Thus it is not applicable when a lagged 
dependent variable appears among the regressors, and indeed it can be shown 
that the combination of a lagged Y variable and a positively autocorrelated 
disturbance term will bias the Durbin-Watson statistic upward and thus give 
misleading indications.¢ Even when the conditions for the validity of the Durbin- 
Watson test are satisfied, the inconclusive range is an awkward problem, espe- 
cially as it becomes fairly large at low degrees of freedom. A conservative 
practical procedure is to use d,, as if it were a conventional critical value and 
simply reject the null hypothesis if d < d,,. The consequences of accepting Hy 
when autocorrelation is present are almost certainly more serious than the 
consequences of incorrectly assuming it to be absent, which is one reason for the 
procedure.§ Second, it has been shown that when the regressors are slowly 
changing series, as many economic series are, the true critical value will be close 
to the Durbin-Watson upper bound. 

When the regression does not contain an intercept term, d is bounded by 


dy <d<dy 


where d, is the upper bound of the conventional Durbin-Watson tables. 
Farebrother has provided extensive tabulations of both lower and upper | percent 
and 5 percent significance points for d4,,|| 


+N. E. Savin and K. J. White, “The Durbin-Watson Test for Serial Correlation with Extreme 
Sample Sizes or Many Regressors,” Econometrica, vol. 45, 1977, pp- 1989-1996. 

+ M. Nerlove and K. F. Wallis, “Use of the Durbin-Watson Statistic in Inappropriate Situations,” 
Econometrica, vol. 34, 1966, pp. 235-238. 

§A comprehensive Monte Carlo study relevant to this question is J, K. Peck, “The Estimation of a 
Dynamic Equation Following a Preliminary Test for Autocorrelation,” Cowles Foundation Discussion 
Paper, no. 404, September 9, 1975. After studying the properties of regression estimators following 
different significance levels for d, the author recommends using a significance level much more likely 
(than the conventional levels) to reject Hy when it is true. This is in the same spirit as using d,, as the 
critical value. 

{H. Theil and A. L. Nagar, “Testing the Independence of Regression Disturbances,” Journal of 
the American Statistical Association, vol. 56, 1961, pp. 793-806; and E, J. Hannan and R. D, Terrell, 
“Testing for Serial Correlation after Least Squares Regression,” Econometrica, vol. 36, 1968, pp: 
133-150. 

|| R. W. Farebrother, “The Durbin-Watson Test for Serial Correlation when There Is No Intercept 
in the Regression,” Econometrica, vol. 48, 1980, pp. 1553-1563. 
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The inconclusive range of the Durbin-Watson statistic can be narrowed if 
explicit account can be taken of the form of any regressors in addition to the 
constant term, since that reduces uncertainty about the X matrix. King has 
presented tabulations of d, and d, for three classes of linear regression models, 
namely: 


1. Regressions with a full set of quarterly seasonal dummy variables 
2. Regressions with an intercept and a linear trend variable 
3. Regressions with a full set of quarterly seasonal dummies and a linear trend 
variablet 
The Wallis test for fourth-order autocorrelation.¢ Wallis has pointed out that 


many applied studies employ quarterly data, and in such cases one might expect 
to find fourth-order autocorrelation in the disturbance term. The appropriate 


specification is then 


U, = 4,4 + & (8-64) 
and the null hypothesis would be 
Hy: %=9 
To test the null hypothesis, Wallis proposes a modified Durbin-Watson statistic 
ie (e 6; win 
d,= as (8-65 
‘ Prete, ) 


where the e’s are the usual OLS residuals. Wallis derives upper and lower bounds 
for d, under the assumption of a nonstochastic X matrix. The 5 percent 
significance points are tabulated in App. B-6. The first table is for use with 
regressions with an intercept, but without quarterly dummy variables, The second 
table is for use with regressions incorporating quarterly dummies. As shown in 
Chap. 6, one may employ a constant term and three quarterly dummies or use 
four quarterly dummies without a constant term. 

Further significance points at 0.5, 1.0, and 2.5 percent levels are provided by 
Giles and King.§ The same authors also point out that if one is testing H, against 
the alternative hypothesis H,: 4 < 0, the test statistic 4 — d, may be correctly 
referred to the critical values 4 — d, y and 4 — dy ;, where d, ,, and d, , are the 
5 percent values tabulated by Wallis, only in the case where seasonal dummies 
have been included among the regressors. For the case where an intercept but no 
seasonal dummies have been employed, these critical values are inappropriate and 


the authors provide a revised set. 


+™M. L. King, “The Durbin-Watson Test for Serial Correlation: Bounds for Regressions with 


Trend and/or Seasonal Dummy Variables,” Econometrica, vol. 49, 1981, pp. 1571-1581. on 
+K. F. Wallis, “Testing for Fourth Order Autocorrelation in Quarterly Regression Equations, 


Econometrica, vol. 40, 1972, pp. 617-636. oh 
§ D. E. A, Giles and M. L. King, “Fourth-Order Autocorrelation: Further Significance Points for 


the Wallis Test,” Journal of Econometrics, vol. 8, 1978, pp- 255-259. 
1M. L. King and D. E. A. Giles, “A Note on Wallis’ Bounds Test and Negat 


Econometrica, vol. 45, 1977, pp. 1023-1026. 


tive Autocorrelation,” 
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Durbin tests for a regression containing lagged values of the dependent variable. 
As has been pointed out, the Durbin-Watson test procedure was derived under 
the assumption of a nonstochastic X matrix, which is violated by the presence of 
lagged values of the dependent variable appearing among the explanatory vari- 
ables. Durbin has derived a large sample (asymptotic) test for the more general 
case.} Consider the relation 


Y=BY-, ++ +BY, + BX + +++ + BasXe, + u, (8-66) 
with : 
u,=u,_,+#, and  «e~ N(0,021) 


The basic result is that under the null hypothesis, Hj: = 0, the statistic 
n 
h= SAN (OF I 8-67 
"V 1 — nvar(b,) step =< 
where n= sample size 


var(b,)= estimated sampling variance of the coefficient of Y,_, in the OLS 
regression of Eq. (8-66) 
r= Li 20:€,-1/Lpm 27-1» the estimate of ¢ from the OLS regression of e, 
on e,_,, the e’s in turn being the residuals from the OLS regression of 
Eq. (8-66) 


Thus the test procedure is as follows. 


. Fit the OLS regression denoted by Eq. (8-66) and note var(b,). 

2. From the residuals compute r or, alternatively, if the Durbin-Watson statistic 
has been computed, we may use the approximation r = 1 — d/2. 

3, Substitute in the formula for h, and if h > 1.645, reject the null hypothesis at 
the 5 percent level of significance in favor of the hypothesis of a positive 
first-order autocorrelation. 

4, A similar one-sided test for negative autocorrelation can be carried out for 

negative h. 


The test breaks down if it should happen that n « var(b,) > 1. Durbin showed 
that an asymptotically equivalent procedure is the following. 


1. Estimate the OLS regression of Eq. (8-66) and obtain the residual e’s. 
2. Estimate the OLS regression of 
py 5, OMG ayy Moedypoticny Ryley Ap nebsing aX 
3. If the coefficient of e,_, in this regression is Significantly different from zero 
by the usual OLS test, reject the null hypothesis Hp: @ = 0. 


tJ. Durbin, “Testing for Serial Correlation in Least Squares Regression when Some of the 
Regressors are Lagged Dependent Variables,” Econometrica, vol. 38, 1970, pp. 410-421. 
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Breusch-Godfrey test. The procedures considered so far test the significance of a 
single autocorrelation coefficient. One might expect these tests to have reasonable 
power in the presence of more general forms of autocorrelation. For instance, if 


U, = OU,_, + O.U,_-2 tee + PU» +e, 
the d test might well show ¢, to be significantly different from zero. But $, 
represents just a part of the autocorrelation now present, and one might find an 
insignificant value for the first-order statistic. This, however, sheds no light on the 
significance of ,,..., $,. Thus a more general test is clearly desirable. Such a test 
has been developed, apparently independently, by Breusch and by Godfrey.} 
They postulate the usual model y = XB + u, where the X matrix may include 
lagged values of the dependent variable. The null hypothesis is 


Hy: u~ N(0, o21) 
Two alternative hypotheses are considered. One is that the (u,) are generated by 
an AR( p) process, 

Beare, ts teeter (8-68) 
The other hypothesis is that the {u,} are generated by an MA( p) process, 

Uy = ete apt s+ +a,8,_, (8-69) 
where in each case (e,) is well-behaved. The test is based on the OLS residual 
vector e, and is essentially a test of the joint significance of the first p autocorrela- 
tions of these residuals. A remarkable feature of the test is that the same test 
statistic applies for either alternative hypothesis. The components of the test 


Statistic are 
e = y — Xb, the usual m X 1 vector of OLS residuals 


6? = e’e/n, the ML estimate of o? 


0 0 0 
e 0 0 
e e, 0 
E,=[e. & Sls | 9%, felt tha a te. a 
e 
Us 
Sk Maeve ae 
The test statistic is 
= =I, 
1 =e, [B,E, — E,X(X'X) 'X'E,] 'Eje/6? (8-70) 


+L. G. Godfrey, “Testing Against General Autoregressive and Moving Average Error Models 
when the Regressors Include Lagged Dependent Variables,” Econometrica, vol. 46, 1978, pp. 
1293-1302; and T. S. Breusch, “Testing for Autocorrelation in Dynamic Linear Models,” Australian 


Economic Papers, vol. 17, 1978, pp. 334-355. 
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which, under the null hypothesis, is asymptotically distributed as xp). The 
asymptotic properties of the test would not be affected if 6? were replaced by the 
usual unbiased estimator s? = e’e/(m — k). Significantly large values of / would 
lead to the rejection of the null hypothesis, but would not indicate which of the 
alternative hypotheses, Eq. (8-68) or Eq. (8-69), should be regarded as the more 
appropriate. Godfrey shows that when the X matrix contains only exogenous 
variables, the test statistic is asymptotically equivalent to 

fan(r2 +2 + --- +72) 
where 


n 
v4 Dimi Orem i 
‘ re} 
rie 


is the jth autocorrelation coefficient of the OLS residuals. 

The test statistic defined in Eq. (8-70) may seem to imply a burdensome 
amount of computation, but it can be expressed in a simpler form. Suppose that 
the OLS regression of y on X has been computed. The X matrix may contain 
lagged values of the dependent variable. Denoting the residual vector from this 
regression by e = y — Xb, it follows that 


i=1,2,... 


{e,} has zero mean 
and eX =0 


Suppose now that e is regressed on the matrix [E, X] where E, is as defined 
above. The R? from this regression is given by 


ye ESS 


since no correction is required for the mean of the dependent variable. Moreover, 
from Eq. (5-53), 
BE, EX| VE 
me" he P P 
Ess = e'[E, x] XE, XX | re 
Using e’X = 0, this simplifies to 


ESS = eE, [E,E, — E,X(X’X) 'X'E,] ‘Eve 


Thus LS 
62 
and since 6? = e’e/n, we have 
| = nR? (8-71) 


The test procedure is thus as follows. 


1, Fit the OLS regression of y on X to obtain e. 

2. Regress e on {E, Xj and obtain R?. 

3. Refer nR? to the x?(p) distribution and reject the null hypothesis if a 
significantly large value is found. 
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In practice the second step in the test procedure might be described as 
2. Regress e, ON €,_,---» &:—p> and x, (that is, the 7th row of X). 


Since there are only n values of e available, this regression might be carried 
out using only the last n — p observations. The Breusch-Godfrey procedure sets 
Co, @_js--+ at Zero. Asymptotically it does not matter which route is taken, and it 
is a moot point whether it matters in finite samples. 


Estimation with Autocorrelated Disturbances 


The Durbin test for a regression and the Breusch-Godfrey test for the presence of 
autocorrelated errors are applicable when lagged values of the dependent variable 
appear in the X matrix. In discussing estimation procedures we will restrict 
consideration in this section to nonstochastic X matrices. Additional remarks on 
estimation in the presence of lagged dependent variables will be made in Sec. 9-2. 
Consider again the model 
y=XBp+u 

with E(u)=0 and E(uv’) = 07 
From the discussion in Sec. 8-3 it is clear that GLS estimation may be achieved if 
it is possible to find a transformation matrix T of known parameters such that 
T’T = Q"! and then apply OLS to the transformed variables Ty and TX. As an 
illustration, suppose we have just a two-variable regression where the disturbance 
follows an AR(1) scheme, that is, 

Y,=a+bX,tu, t= 1s.” (8-72) 
and u, = pu, + & 
with |p| < 1 and well-behaved e’s. The form of & for this model has already been 
given in Eq. (8-11) and its inverse in Eq. (8-55) as 


1 ae i 0 0 0 
re pn te ic havens BS Deine 
1-p} 0 0 0 —p 1+ 
0 0 0 cpa 0 —p 
Consider first an (n — 1) X” transformation matrix T, defined by 
= paiva hae (aBs famine 
Ty = | 0- yit 1 f 
iow cOraO: apy I 


Multiplication then shows that T,T, gives ann Xn matrix which, apart from a 


proportionality constant, is identical with Q~' except for the first element in the 
leading diagonal, which is p? rather than unity. Now consider the n x n matrix T 


obtained from T, by adding a new first row with /1 — p* in the first position and 
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zeros elsewhere, that is, 


1-p? 0 0 0 0 
y=| —P eG 0 0 
0 -—p 1 0 0 
weigotl ED tem Hes des Paes] 


Multiplication shows that T/T = (1 — p?)Q~'. The difference between T, and T 
lies only in the treatment of the first sample observation. Applying T, to Eq. 
(8-72) gives the transformed model 


¥,— pY, "i & 
Y, — pY, 27 PA, = e 
speeches 5 AERO ce 2) Wa el eC) 
ie yelp 
Vice Note nn PA, | &, 


so that only n — 1 transformed observations are used in the OLS estimation. The 
variables in Eq. (8-73) are sometimes referred to as quasi first differences, and 
the intercept term being estimated is now a(1 — p). The variance matrix of the 
disturbance term in Eg. (8-73) is an (n — 1) X (n — 1) matrix, 


var(e) = (1 — p*)o7I 


Application of T to Eq. (8-72) gives the transformed model 


=| earns aes [al + i (8-74) 


¥, ~ p¥,-4 


The y1— p* factor is required to make the transformed disturbances homo- 
scedastic. From Eq. (8-9), 


a = o7(1 — 6°) 


for an AR(1) scheme. Thus var[/1 -~p- he ae 

If p were known, GLS estimation could be achieved by applying OLS to Eq. 
(8-74), or the process could be approximated by using Eq. (8-73). The difference 
between the two procedures can be important when the sample size is small. The 
extensions to include additional explanatory variables and higher-order AR 
processes are simple. Additional X's are treated in exactly the same way as the 
single explanatory variable in the example. If the disturbance term followed an 
AR(2) scheme, 


Uy = iy, + GoM,_2 + &, 
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the transformed variable would take the form 
Y,— %:%-1— %-2 -°"° t=3,...,n 
Special transformations would also be required for the first two observations.+ 
The assumption, however, of a known value for p is unrealistic. It is a 
parameter to be estimated along with a, 8, and 2. Lagging Eq. (8-72) one period 
and subtracting from Eq. (8-72) gives 
¥,=a(1—p)+BX,— BpX,, + p¥,; +e t=2,...,n (8-75) 
The disturbance in Eq. (8-75) satisfies the assumptions required for OLS. How- 
ever, 
Le? = f(a, B, p) (8-76) 
which is a function of just three unknown parameters, while a straightforward 
application of OLS to Eq. (8-75) will yield four estimated coefficients, namely, 


b, = al — p) 
b,=B 

b, = —Bp 

by = 


where the circumflex denotes an estimate of the corresponding parameter. These 
four equations will, in general, be inconsistent in that they do not yield a unique 
set of estimates of a, B, and p. Thus a nonlinear constraint, b, = —b,b,, would 
have to be imposed on the estimation process. To put the same point another 
way, minimizing Le? with respect to a, 8, and p gives equations which are 
nonlinear in the parameters and thus cannot be solved analytically. 

The basis of an iterative estimation process can be found by rewriting Eq. 
(8-75) in two equivalent fashions, namely: 


1. ¥,— p¥,_, = a(1 — p) + B(X, — pX,-1) + & 
and 
2. (¥,— a — BX,) = p(¥,_, — « — BX,_,) + & 


Starting with any value for p, the quasi first differences in the equation of step 1 
could be computed, and OLS applied to it would then yield estimates of a and . 
These estimates in turn can be used to compute the ¥, — a — BX, series. Regress- 
ing this series on itself lagged one period in the equation of step 2 yields a revised 
estimate of p, which can then be fed back into the equation of step 1, and the 
process continues. 

This is known as the Cochrane-Orcutt iterative process, and versions of it are 
incorporated in almost all social science computer packages. There is a variety of 


+ For details see F. B. Lempers and T. Kloek, “On a Simple Transformation for Second-Order 
Autocorrelated Disturbances in Regression Analysis,” Statistica Neerlandica, vol. 27, 1973, pp. 69-75. 

+D. Cochrane and G. H. Orcutt, “Application of Least Squares Regressions to Relationships 
Containing Autocorrelated Error Terms,” Journal of the American ‘Statistical Association, vol. 44, 1949, 


pp. 32-61. 
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starting positions and rules for termination. If the initial value of p is set at zero, 
step 1 is then simply the OLS regression of Y, on X,, which yields the OLS 
residuals e, = Y, — a — bX,, In step 2 e, is regressed on e,_,, without an intercept 
term, to obtain an estimate r of the first-order autocorrelation coefficient. Alterna- 
tively r may be computed from the Durbin-Watson statistic, which is a routine 
output in an OLS package, as 
r=1-4d 

The estimated r is then used to compute the series (Y, — r¥,_ ,) and (X, — rX,_,), 
which are used in a repeat of step 1. The process may be stopped any time the 
Durbin-Watson statistic in step 1 indicates random residuals. This frequently 
occurs after one complete iteration. Alternatively one can stop the process after 
successive estimates of the parameters differ by less than some prescribed amount. 

It is clear that step 1 in the Cochrane-Orcutt process is the use of model 
(8-73) and the associated transformation matrix T,. A modification of the process 
is to use model (8-74), where the first term gets explicit treatment. This is often 
referred to as the Prais-Winsten method.} The Prais-Winsten modification may be 
expected to improve the efficiency of the estimation, especially in small sample 
sizes. Yet another modification is to use a method suggested by Durbin for 
obtaining the initial estimate of p.f This is to fit Eg. (8-75) by OLS without 
worrying about the nonlinear restriction and take r as the coefficient of DOP: 
Monte Carlo study by Griliches and Rao suggests that a two-step estimator 
consisting of the Durbin estimate of p followed by the Prais-Winsten treatment of 
the transformed variables performs somewhat better than any of the other 
variants over a fairly wide range of parameter values.§ 

The two-step Durbin estimator extends easily to more than one explanatory 
variable and to higher order autoregressive schemes. For example, suppose the 
model is 

¥, = By + B)Xy, + +++ + BEX, + uy 
with 4, = iMy_2 + b2U,_2 + & 
Combining the two equations gives 
Y= O:%-1 + b2¥,_2 + ByXy, + +> + BX, — $B, Xp) — 7+ 

4 ~PBiXe, 1-1 — 2B) Xp 2-2 — = $2B, X12 +(1 — $ — $2)B, + , 
Let , and ¢, denote the coefficients of Y,_, and ¥,_, when this regression is 
fitted by OLS. The transformed variables are then computed as 

(% — oY) — $¥,_2), (X, =a $1.1 ot $2%;,,-2) 
1 RR IT OF Re eae 
and OLS is applied to these to obtain estimates of the B’s. 


+S. J. Prais and C. B. Winsten, “Trend Estimators and Serial Correlation,” Cowles Commission 
Discussion Paper, no. 383, Chicago, 1954, 

+ J. Durbin, “Estimation of Parameters in Time Series Regression Models,” Journal of the Royal 
Statistical Society, ser. B, vol. 22, 1960, pp. 139-153. 

§ Z. Griliches and P. Rao, “Small Sample Properties of Several Two Stage Regression Methods in 
the Context of Autocorrelated Errors,” Journal of the American Statistical Association, vol. 64, 1969, 
pp. 253-272. 
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The iterative procedures described above will, in general, converge to a 
solution vector since at each stage one is minimizing a quadratic function in the 
unknowns. There remains the question of whether one has reached a local or a 
global minimum of the sum of squares function. This may be investigated by 
performing a grid search over the permissible range of p values. For an AR(1) 
scheme stability requires |p| < 1. Thus a grid of p values might be specified 
ranging from —1 to +1 by increments of 0.1. For each value step | of the 
Cochrane-Orcutt process is applied and the residual sum of squares computed. 
The p value and the associated a and f values with the minimum residual sums of 
squares are then chosen as the estimates. Or a finer grid may be imposed around 
this p value and a further grid search carried out to achieve greater precision. 

When some parameters of the variance matrix have to be estimated, we have 
a further example of feasible GLS estimation. Thus our conventional test proce- 
dures no longer have an exact finite sample justification but are only justified 
asymptotically. The tests should be based on the final least-squares regression 
computed in either a two-step or an iterative procedure, as the usual standard 
error formulas will, in general, yield consistent estimates of the asymptotic errors. 
The justification for this remark is provided at the end of the next section on ML 
estimation. 


Maximum Likelihood Estimation 


A full ML procedure for a regression equation with an AR(1) disturbance has 
recently been proposed, and the algorithm has been incorporated in some 
computer packages. The model considered is 
Y=Xp+u 
with 
u,=pu,,+e, E(e)=0 E(ee’) = 071 


The likelihood function is then 


L( eee ees) 

Geer, lee 

Using the transformation matrix T defined above, we have 
Tu=e 

where u and e are both n X 1 vectors. Changing variables in the likelihood 


function givest 


L(u) = L(e) | 


where |de/du| indicates the absolute value of the determinant formed from the 
matrix of partial derivatives of the e’s with respect to the u’s. In this case 


Je | _ det T) = 1-6 


ou 
+C. M. Beach and J. G. MacKinnon, “A Maximum Likelihood Procedure for 
Autocorrelated Errors,” Econometrica, vol. 46, 1978, pp. 51-58. 
+ See App. A-9, Change of Variables in Density Functions. 


Regression with 
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and so 


SRY) SRL ay A (8-77) 
Zee Ne 202 
where & means is proportionate to. 

Thus a procedure which minimizes e’e, even one that incorporates the 
Prais-Winsten treatment of the first observation, is not full ML since it has not 
taken account of the term in 1 — p* in the likelihood function. Beach and 
MacKinnon devised an iterative procedure for maximizing Eq. (8-77), which has 
now been incorporated in White’s SHAZAM program and also in new versions of 
the time-series processor (TSP).} They also conducted some sampling experiments 
which suggest that their procedure may yield better estimates than conventional 
procedures, such as the Cochrane-Orcutt process, and may also be computa- 
tionally less expensive. Some further experiments conducted by Harvey and 
McAvinchey compare the full ML procedure not just with the two-step 
Cochrane-Oreutt process, which was the comparison in the Beach-MacKinnon 
study, but also with the iterative Cochrane-Orcutt and with the two-step Prais- 
Winsten procedures.} Their study uses the root-mean-square error (RMSE) of 
estimators as the principle of comparison and confirms the results of the Beach- 
MacKinnon experiments, but it also brings out a number of important points. 


1. The two-step Prais-Winsten method is as efficient as full ML estimation for 
the parameter values underlying their experiments. 

2. The iterative Cochrane-Orcutt process is sometimes inferior to two-step 
Cochrane-Orcutt, especially when the explanatory variable is basically a time 
trend. 

3. The two-step Cochrane-Orcutt process is in turn inferior to the two-step 
Prais-Winsten method when the explanatory variable is trending. 

4. OLS has RMSEs only about 3 to 4 percent in excess of full ML estimation 
with trending data, but its relative performance deteriorates when X is a 
Stationary random series. 


A more recent study by Park and Mitchell confirms the main findings of 
Harvey and McAvinchey and adds some additional findings.§ 


1. Their range of estimators includes an iterative version of Prais-Winsten, with 
p estimated from the least-squares residuals, and they find this to be the best 
of the feasible estimators. 


+K. J. White, “A General Computer Program for Econometric Methods—SHAZAM,” 
Econometrica, vol. 46, 1978, pp. 239-240. 

+A. C. Harvey and I. D. McAvinchey, “The Small Sample Efficiency of Two-Step Estimates in 
Regression Models with Autoregressive Disturbances,” Discussion Paper no. 78-10, University of 
British Columbia, April 1978. See also, A. C. Harvey, The Econometric Analysis of Time Series, Wiley, 
New York, 1981, pp. 196-199. 

§R. E. Park and B. M. Mitchell, “Estimating the Autocorrelated Error Model with Trended 
Data,” Journal of Econometrics, vol. 13, 1980, pp. 185-201. 
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2. They also investigate how well the various estimators perform in hypothesis 
testing by looking at the number of type I errors in 1000 trials at the 0.05 
significance level. The results are only reported for positively autocorrelated 
disturbances, but the message is very clear. All estimators seriously under- 
estimate standard errors, making estimated coefficients appear to be much 
more significant than they actually are. This is, of course, to be expected for 
OLS, but it is also fairly substantial for two-stage Prais-Winsten (2SPW), 
iterative Prais-Winsten (ITERPW), and Beach-MacKinnon ML (BM). For a 
sample size of 20, p = 0.8, and GNP as the trending explanatory variable, the 
number of type I errors reported are OLS (449), 2SPW (251), ITERPW (246), 
and BM (258). These numbers should be contrasted with an expected range 
of 37 to 63. Thus it would be advisable to apply more stringent significance 
levels than usual in testing coefficients in models with autocorrelated dis- 
turbances. 


Beach and MacKinnon have extended their ML approach to accommodate 
an AR(2) process.t The treatment of relationships where the disturbance term 
follows an MA or ARMA process is less well developed than the AR case.+ 

As an illustration of the derivation of the asymptotic errors for ML estima- 
tion of a relationship with an AR(1) disturbance, consider the simple model 


Y,=a+ BX, +4, t=1,...,7 
with u, = pu,_, + &, |p| <1 
and e ~ N(0, 021) 


We need to evaluate the information matrix. We are only concerned with 
asymptotic results, and so the treatment of the first sample observation does not 
matter since its effect becomes negligible as the sample size increases. Thus we 
may write the model as 


e, =u, — pu, = (Ya — BX,) — P(¥,-1 -— @— BX,_\) t=2,...,2 
The log likelihood function is then 
1 


-1 eR 
In(27) — z 7 Ino? — FE Des 


fe (2 


In(L) = — 


The first-order partial derivatives with respect to the unknown parameters a, B, p, 


+C. M, Beach and J. G. MacKinnon, “Full Maximum Likelihood Estimation of Second-Order 
Autoregressive Error Models,” Journal of Econometrics, vol. 7, 1978, pp. 187-198. 

+ For an account of recent developments see A. C. Harvey, The Econometric Analysis of Time 
Series, Wiley, New York, 1981, Chap. 6. See also A. C. Harvey and I. D. McAvinchey, “On the 
Relative Efficiency of Various Estimators of Regression Models with Moving Average Disturbances," 
in E, G. Charatsis, Ed., Proceedings of the Econometric Society European Meeting, Athens, 1979, 
North-Holland, Amsterdam, 1981, pp. 105-118. 
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and o? are 
dink _1-p 
da o2 2, 
iy ae | 
BB aE ~ 0X.) 
CA oa | 
ap =I Pe 
din L n-1 bis 
002 aah 202 208 


where all summations are over f = 2,,.., n. Setting these derivatives to zero and 
solving for the parameters gives conditional ML estimators (conditional, that is, 
on X, which is taken as fixed). We may note in passing that the first three 
equations give 

L(Y, = BY,-)) = (n= Ia + PE(X, = 6X,-1) 


L(Y, — bY,-1)(%, — 6X,-1) = @U(X, — 6X,_,) + BEX, — 6X,_,)° 
X(¥, - @— BxX,)(¥_1 — a — BX,-,) 
D0 ain = es Bx...) 


which are the equations of the iterative Cochrane-Orcutt process, the first two 
being the least-squares equations on the quasi first differences and the third the 
first-order autoregressive coefficient of the estimated residuals. 

Turning to the second-order partial derivatives 


p= 


Pink (n=1)(1—p)’ 
da? 0? 

In L 
ap? 


1 
= Reis Spee) 


The cross partial derivatives are 


in L l-p 

da ap aH =F Spee or pX,_,) 
#inL 1 

dadp srk + (1 = p)Zu,_,} 


inl l-p 


Le, 
da do? o2 : 
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@?in L 1 
ap ap ae a2 eX — Le,_1(X, — oX,-1)} 
#inL 1 
GB bok nok NS 
fe fi 
2 In L 1 
dp do2 Ye aia 
Taking the negative of the expectations gives the information matrix 
(n= ey (1= p)E(X eX) 0 0 
1 [C= p)EC% = 0X1) X(X, - pX,-1)° 0 0 
R= = 0 0 (n— 1)o2 0 
0 0 0 a=) 


2 
20, 


The crucial feature of this information matrix is its block-diagonal nature. 
Asymptotically the estimates of the regression parameters « and B are distributed 
independently of the estimate of the autocorrelation parameter p and of the 
estimate of 02, Referring back to Eq. (8-73), the data matrix for this regression is 
given by the (m — 1) X 2 matrix X,, where 
X= 5 =p Sy Lies a 
. X,—pX, X,—pX, +++ Xe — PX 

with unknown parameters a and B. The 2 x 2 submatrix in R is easily seen to be 
X’,X,. Since the asymptotic variance matrix is given by R™ ' and since this has the 
same diagonal form as R, we have 


a = 
asy var = 02 (X4X«) t 
which is consistently estimated by the usual least-squares procedures, justifying 
the remark at the end of the previous section. We also see from R™! that 
1-? 
met 


asy var( 6) = 


remembering that o2 = o2/(1 — p*). 


Prediction in the Presence of Autocorrelated Disturbances 
If the model 
Y,=a+ BX,+u, uy, = Pui. t+ & 
has been estimated from n sample observations, the best prediction of Y in period 
n + 1 is no longer ‘ 
Yoo = 4+ OXya1 


where a and b are estimates obtained by any of the above methods, since this 
prediction sets the disturbance term at zero, and the AR(1) process implies 
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E(u,,,,) = pu,. Both elements in pu, are unknown, but might be estimated by 

ra, = r(¥, — a — bX,) 
The suggested predictor is then 
Ri =at bX,,, + 1ri, 
This, in fact, would be a best linear unbiased predictor if p were known and r set 
equal to p since it is the predictor that would emerge from the relation 

Y, — pY,_, = (1 — p) + B(X, -— pX,_,) +2, 
which may be rewritten as 
Y,=a+ BX, + p(X, —a- BX,,) +e, 

giving the predictor.} 

Yuu =a + bX,4, + pd, 


8-6 SETS OF EQUATIONS 


Sets of equations occur in various branches of economic theory. In the theory of 
consumer behavior the decision maker faces a given money income M and a set of 
prices P,,..., P,. The assumption of utility maximization leads to a set of demand 
equations 
Q=f(PiP,-...P,M) i=1,...,7r 

where Q; indicates the optimal rate of consumption of the ith commodity. Theory 
imposes various conditions on these demand equations. The assumption of a 
specific form of utility function will impose yet further conditions. For example, if 
one postulates an indirect addilog utility function 


M\* 
me y [eS 
4 Ea 7) 
the ith demand equation is¢ 


om a,b,M"P- le 
Lia Sr 
Lja,b,M* PY 
where we have inserted a multiplicative disturbance term e“ to prepare the way 
for empirical estimation.§ Expenditure Z, = P,Q, on the ith commodity is then 
given by 


i=1,...,7 (8-78) 


a;b,M"'P-bie% 


E,a,Mo'P® NERF (8-79) 
J J 


+ For a detailed treatment of this topic see A. S. Goldberger, “Best Linear Unbiased Prediction in 
the Generalized Linear Regression Model,” Journal of the American Statistical Association, vol. 57, 
1962, pp. 369-375. 

+ For this result and indeed for an elegant and lucid presentation of the theory and measurement 
of demand systems see L. Phlips, Applied Consumption Analysis, North-Holland, Amsterdam, 1974. 

§ The e in the disturbance term indicates the mathematical constant e = 2.71828, and should not 
be confused with the use of the same symbol for OLS residuals. 
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The expenditure equation (8-79) is nonlinear in the a’s and b’s. However, if the 
logarithm of the ratio Z,/Z; is taken, 


M M 
dn Z, - In Z, = Ajj + b,n( 5 - oi +; (8-80) 
by; 
where A,, = In eh 
a,b, 
and Uy = & — & 


This is clearly an estimable equation. Given r commodities, there are r(r — 1)/2 
such equations, but most are redundant. As an illustration, for commodities i and 


k we have 
In Z, — In 2 = Au + bn| | ~ bala + Ui, (8-81) 
P, PE 


Subtracting Eq. (8-80) from Eq. (8-81) gives 


M M 
In Z,— In Z, = Ay + on > = ouln( 5) + Uj, 


h a a,b; 
where A= a,b, 
and Uj = & — & 


Thus of the three possible equations for commodities i, j, and k only two are 
independent. Given any pair of equations, the third follows by subtraction. For r 
commodities there are just r— 1 independent equations, and for estimation 
purposes one may select any set of r — 1 independent equations. Thus one might 
write the system 


M, M, 
In Z,, — In Z,, = Ay. + bun( 7) ~ bol + uy, 


. 


M, M, 
In Z,, — In Z;, = Aj3 + oun( 7) ~ bso] yg Wit Leen 


M, M,\ 
In Z,,~ In Z,, = Ay, + Bylo *) — b,ln| pt) + Mn 


Define Y,, = In Z,,— In Z,, andy, =[%\ Yr" YJ 
The sample observations on the first equation in Eqs. (8-82) may then be 
written as 
y, = XB, + 4, 
where 
{ | | A €), — £21 
12 e)2 — & 
M M [er 
w|i) a] 9 [8] 
| 
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Likewise, defining Y,, = In Z,, — In Z,,, the sample observations on the second 
equation in Eqs. (8-82) may be written as 


y. = X2B, + u, 
where 
&,, — & 
Ls . M Ais en = a 
X,=]i in( 3) -in( 7) B, = u, = 
ea | : fin ~ &n 
The complete model implied by Eqs. (8-82) is then 
y, = X,B, + u, 
Y. = X,B, + u, (8-83) 
Yn = Xe Bat 
where m = r — 1. This set of equations may be written equivalently 
y xX, B, uy 
’: x B, U, 
“|= 2. cal pass] tn (8-84) 
Yor Xn | Br Un, 
or as 
y=XBt+u (8-85) 


Because of the block-diagonal form of X the application of OLS to Eq. (8-85), 
treated as a simple regression, would be exactly equivalent to the application of 
OLS to each of the m equations in Eqs. (8-83) separately. However, the applica- 
tion of OLS to Eq. (8-85) would not be optimal for two reasons. First of all the u 
vector is not homoscedastic. From the structure of the u’s 


u;=e)—€,,, (taal Cerro tomas | 

Thus var(u,) = var(e,) + var(e,, ,) = 2cov(e;, e,,) 
Even if the original e’s are contemporaneously uncorrelated, 

var(u,) = var(e,) + var(e,, ,) 
and var(w,) = var(e,) + var(e,.,) 
Thus the u’s would only be homoscedastic if the e’s were homoscedastic, but 
there is no a priori reason for the disturbance variances in the various expenditure 
equations to be equal. 


A second reason for the nonoptimality of OLS is that the off-diagonal terms 
in var(u) will not be zero. 


E(uju,) a E(e, - &4)(e G41) 
= E(e) — E(e¢;,,) — E(e¢;,,) + E(e,,\841) 


Even if the covariances of the e’s vanish, E( u,u,) does not vanish. Thus var(u) is 
not spherical, and GLS is the appropriate estimation procedure. 
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These conditions on the disturbances may be embodied in the set of assump- 
tions 
E(uy)=06,1  i=1,...,m 


and E(u) = 9,1 i*j,i,j=U...,m 


The first condition allows the disturbance variance to be different in the various 
equations, but within each equation the assumptions of homoscedasticity and 
zero covariances are still imposed. The second condition allows for nonzero 
covariances between the disturbances in different equations and assumes that, for 
any pair, the covariance is the same at each sample point; all lagged covariances, 
however, are assumed to be zero. Collecting these variances and covariances in 
the symmetric positive definite matrix 


Pye Clam she SIM 
Z=]% 92 "Om 
Gm, m2 Onm 


the variance matrix for the u vector in Eq. (8-85) may be written} 
var(u) = V=Z2eI1 (8-86) 


Thus a set of demand equations should almost certainly be considered as a group 
and estimated by GLS because of the nature of the variance matrix of the 
disturbance term. In addition, theoretical considerations will impose restrictions 
across equations. For example, in the addilog demand system above the second 
parameter, b, in each B, vector is constrained to be equal across all m equations. 
This constraint has not been imposed in the specification (8-84), Implementation 
of that system would allow a different estimate of the coefficient b, to be made for 
each commodity. One may wish to test for constancy of b, across commodities 
and to reestimate the system with constancy imposed.¢ 

A second illustration of sets of equations with cross-equation restrictions and 
connections between the various disturbances is found in sets of “share” equa- 
tions approximated by transcendental logarithmic functions, which has recently 
become the dominant methodology in the estimation of various substitution 
elasticities, especially in the field of energy economics.§ Consider a production 
function 

OQ =f(X, X2,+-- X,) 


where Q denotes the rate of output and X, (i = 1,..., 7) the rate of input of the 
ith productive factor. If one assumes the firm to face a given set of factor prices 


+ See Eqs. (4-76) and (4-77) for the definition of a Kronecker product and its inverse. 

+ For an illustration of the estimation and testing of three different demand systems see R. W. 
Parks, “Systems of Demand Equations: An Empirical Comparison of Alternative Functional Forms,” 
Econometrica, vol. 37, 1969, pp. 629-650. 

§ See, for instance, E. A. Hudson and D. W. Jorgenson, “U.S. Energy Policy and Economic 
Growth,” Bell Journal of Economics, vol. 5, 1974, pp. 461-514; E. R. Berndt and D. O. Wood, 
“Technology, Prices and the Derived Demand for Energy,” Review of Economics and Statistics, vol. 
57, 1975, pp. 259-268; and J. M. Griffin and P. R. Gregory, “An Intercountry Translog Model of 
Energy Substitution Responses,” American Economic Review, vol. 66, 1976, pp. 845-857. 
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P,,..., P,, one formulation of the firm’s decision problem is to choose the input 
mix to minimize the cost of producing a given output Q. This gives rise to a set of 
factor demand functions 


X,=f(Pi+5 Ps 2) i=1,...,7 
Denoting the optimal inputs by X, the optimal (minimal) cost level is 


C= DPF = f(Pyyeves Pv) 


Differentiating C* with respect to the factor prices givest 


ac* 
OP, 


i 


= Xt 


} This result is an application of Shephard'’s lemma. (R. W. Shephard, Theory of Cost and 
Production Functions, Princeton University Press, Princeton, NJ, 1970, p. 170.) The lemma may be 
illustrated for a two-factor production function Q = f(X,, X>). Suppose the firm is required to 
produce some stated output Q at minimum cost, given factor prices P, and P,. If we define 

$= (P\X, + PX) — AL /(%. %) - Q) 


where A is a Lagrange multiplier, we then seek the minimum of ¢. The first-order conditions are 


Bog Ey faaas 
ay 7PM 0 
oLs e.eere ya 
ay 7 Ahn 0 () 


The solution of these equations gives the cost-minimizing factor demands Xf and X}, expressed as 
functions of P,, P), and Q. The minimum achievable cost is then given by 


C* = PLXP + PLXE (2) 
Differentiating Eq. (2) partially with respect to P, gives 


and, similarly, that 


Differentiate the system (1) totally, setting dP, = dQ = 0. This gives 
Mfr dX, + fir dXy + fd d = dP, 
Afr dX, + Afn dX, + fp dd =0 (3) 
f,dX,+ fp dX, =0 
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Further 
aCe Py we EXT 
OPP TCM ENC 
ainc*, | PXP — ‘ 
anP, C* 


where S, denotes the cost share of the ith factor, that is, the proportion of total 
cost absorbed by the ith factor. Since C* depends on the factor prices and output, 
the cost shares will be functions of the same variables, that is, 


$= 8(PreeyPoQ) f= Vreet 


By estimating the parameters of the share equations one may be able to estimate 
the parameters of the cost function. All depends on the functional form pos- 


tulated for the cost function. 
A currently favored specification is the transcendental logarithmic (or trans- 


log) function.+ This is a very flexible form, capable of approximating a wide 
variety of functional forms. As an illustration, the production function for the 


industrial sector of an economy is often specified as 
Q=f(K,L,E,M) 


where the inputs distinguished are capital K, labor L, energy E, and materials M. 
Assuming constant returns to scale plus exogenous factor prices Py, P,, Pres and 
P,,, and imposing symmetry on the second-order partial derivatives, gives the 


translog cost function 
InC = ay + In@ + agln Px + ain P, + ain Py + &yin Py 


+ 4Bxx (In Px y + Bx, (In Px )(In P,) + Bxg(In Px)(In Pe) 
+ Bau (In Px )(In Py) + $8p.(in Py + By ¢(In P,)(In Py) 
+Bryr(in P; (in Py) + $Bee(In Pz) + Bew (in Pe)(In Py) 


+ $Byyy (In Pa) 


with the solution 1 
dX, = — zhi aPy 


1 
aX, = ghih ar 


where A is the determinant of the 3 X 3 matrix of coefficients on the left-hand side of Eq. (3). Thus 


ax? ax3 1 
ae Pap, = (Phh > Pf?) 


=0 _ from the first two equations in Eqs. (1). 


“Transcendental 


L. R. Christensen, D. W. Jorgenson, and L. J, Lau, 


+See, for example, 
i of Economics and Statistics, vol. 55, 1973, pp. 28 


Logarithmic Production Frontiers,” Review 
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Differentiating InC with respect to the logs of the prices gives the cost share 
equations 

Sx = ay + Byxln Py + Bx, in P, + Bxeln Pe + Baylin Pry 

S, = a, + Buyin Py + Byyin Py + Byeln Pe + Bpyin Pry 

Sp = a, + Bxgln Py + Byeln P, + Begin Py + Beyln Py 

Sup = On + Byyln Px + Brin Py + Belin Pe + Buln Pry 
Since the shares must sum to unity, 

ax ta, t+ agt ay =1 


and the B’s sum to zero in each column (and row). Imposing the rowwise 
constraints on the first three share equations gives the system 


Px P, Py 
Sx = aq + Brxlo| 7 Py, z + Bex in| #| + Belo | 
S, = a, + Bln Px 7 ‘3 Bula 5 ] + Buetn( Fe Pe ‘| (8-87) 
Py 
P, P. Py 
Sp =a, + Brel a + Buln z| + Been| | 


Because of the symmetry in the ’s there are just nine independent parameters in 
this system. Estimation of these, in conjunction with the summation conditions on 
the a’s and B’s, will yield estimates of all the coefficients of the cost function 
except a. 

For the translog cost function the Allen partial elasticities of substitution are 
given by 


B, + SS, 
é 2 4 j= j 
4 S5; j 
2_ 
ae 5, 0 fat L-S 
Ss? 


and the factor price elasticities by 
Nj = 6, Sj 


Since the four shares sum identically to unity, one must expect nonzero contem- 
poraneous covariances between disturbances in different equations, and there is 
also no a priori reason to expect the same disturbance variance in different share 
equations. However, this system differs in one major aspect from the addilog 
demand functions in Eqs. (8-82). In Eqs. (8-87) the same set of explanatory 
variables appears in each share equation, but that is not true in Eqs. (8-82), and 
we will return to the significance of this point below. At the next level of 
disaggregation a production function could be specified for the energy sector with 
various specific fuels as inputs and the parameters estimated from a set of energy 
cost share equations. There have been many applications of this cost share 
approach in recent years. However, a word of caution is required. As the 
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derivation made clear, a basic assumption underlying the derivation of the share 
equations is that in each observation period in the sample there has been a full 
and complete adjustment of the input mix to the factor prices ruling in that 
period so that the minimum cost level C* is achieved. This is an implausible 
assumption for many production processes, and actual cost shares probably 
represent various /agged adjustments to changing factor prices. The assumption of 
instantaneous adjustment is likely to produce seriously biased estimates of the 
various elasticities. 


Feasible GLS Estimation 


Returning now to the general set of equations set out in Eqs. (8-83) to (8-86), the 
GLS estimator of B is 


by = (X'V-'X)'X’V7ly 
From Egs. (8-86) 


where o'/ denotes the i, jth element in =~ '. Substituting for V_' in the formula 
for by gives 


(8-88) 


and the associated variance matrix is 
var(b,) = (X'V~'X) 7! (8-89) 


The obvious operational difficulty with Eq. (8-88) is that = is unknown. 
Zellner has proposed the construction of a feasible estimator as follows.+ 


1. Apply OLS separately to each equation in Eqs. (8-83), obtaining the vectors 
of sample residuals e,,€,,..-, @ Where 


ea [texoexy xy ttm 


2. The diagonal elements o,, of 2 are estimated by 


ere, 
Sig aa 


nk, 


+A. Zellner, “An Efficient Method of Estimating Seemingly Unrelated Regressions and Tests for 
Aggregation Bias,” Journal of the American Statistical Association, vol, 57, 1962, pp. 348-368, 
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and the off-diagonal elements s,, by 
ee, 


Sie SS ee 

He ky(a> 7 

where k, denotes the number of columns in X,.t The denominator in these 
estimates may alternatively be taken simply as n since the usual test proce- 
dures will now only be valid asymptotically. Thus an estimated & matrix is 
computed and substituted in Eq. (8-88) to give a feasible estimator. The usual 
significance tests based on an estimated version of var(b,) now have an 
asymptotic justification rather than small sample validity. 


This estimator is often referred to as SURE (seemingly unrelated regression 
equations) estimator after the title of Zellner’s original paper. This title is 
something of a misnomer, since the most natural application of the technique is‘to 
sets of equations which are indeed theoretically related, as in the two examples. 
The gain in efficiency yielded by the Zellner estimator over OLS increases directly 
with the correlation between disturbances from the different equations and 
decreases as the correlation between the different sets of explanatory vari- 
ables increases. Indeed the GLS estimator reduces to OLS if either (1) the o,, are 
all zero or (2) the X, are identical.{ Even if the true correlation between equation 
disturbances is zero, the sample OLS residuals may yield nonnegligible covari- 
ances, and one might mistakenly compute GLS estimates. The result will be 
estimates with somewhat greater standard errors than those of the OLS coefficients. 
This will even be true for very small disturbance correlations, but as these 
correlations increase, the efficiency of the GLS over the OLS estimates rises 
substantially.§ 


Tests of Linear Restrictions 


In order to see how to test a set of linear restrictions in the SURE model we must 
extend the test developed in Chap. 5 under the OLS assumptions to fit the new 
GLS assumptions. As we have seen in Sec. 8-3, the GLS estimator may be 
obtained by applying OLS to the transformed equation 


Ty = (TX)B + Tu 
where 
TT=v"' 
Making the appropriate substitution of Ty for y and TX for X in the OLS test 


+ In the two illustrative examples the X, matrices had an equal number of columns, but there is no 
need to impose such a condition generally. The exposition also assumes an equal sample size in each 
regression, but this is merely a simplification and need not be imposed generally. 

$ See Problem 8-2. 

§J. Kmenta and R. F. Gilbert, “Small Sample Properties of Alternative Estimators of Seemingly 
Unrelated Regressions,” Journal of the American Statistical Association, vol. 63, 1968, pp. 1180-1200. 
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statistic for Hy: RB =r, now gives the statistic 


(r — Rb,)'[R(X’V-'X)~'R’] "'(r — Rb.) /g 
eV 'e/(n—k) 


(8-90) 


where by is the GLS estimator, g is the number of restrictions embodied in the 
null hypothesis, and e = y — Xb,. Under the null hypothesis this statistic follows 
the F(q, n — k) distribution. 

The SURE model specified in Eqs. (8-84) to (8-86) gives a special case of this 
statistic. There are m separate equations with n observations on each, giving 
N = mn observations in all. There are k, variables in X,, and the estimation of the 
unrestricted model, Eq. (8-84), will thus yield estimates of K = £7. ,k, parameters. 
Finally the V matrix has the special form shown in Eq. (8-86). Thus the test 
Statistic becomes 


(r — Rb,)'(R[X(3-! @ X]~'R’}|(r — Rb,)/g 
e(Z~' @ Ie/(N - K) 


Finally the unknown & in Eq. (8-91) has to be replaced by 3, containing the Sij 
defined above, and the test now has only an asymptotic justification. 

As an illustration of the construction of the R matrix consider an addilog 
system, Eqs. (8-82), with just four commodity groups and hence three estimated 
equations. Application of the SURE technique to Eq. (8-84) will give the GLS 
vector b,, containing nine estimated parameters, namely, 


b= [A by b, Ay ig by Aig by? 6,]’ 
Thus each of the equations gives an estimate of the b, parameter, as indicated by 


the superscript. The null hypothesis is that the true value of this parameter is the 
same in all three equations. This translates into a two-element constraint, namely, 


(8-91) 


BW” — 62) =0 
H — 6 =0 
slats scilly afc Osta 0 Wilecoil artg RON As Onde 0 paige 
_— r=} 17 COD OP 20110 
and 


“fi 


If the null hypothesis is not rejected and one wishes to reestimate the system 
with the constraint imposed, one may take the formula for the restricted OLS 
estimator, given in Eq. (6-5), and replace X by TX to obtain 

bys = be + (X'V>'X) 'R[R(X’V-'X) 'R’]'(r— Rb.) (8-92) 


Equivalently one may arrange the columns of the data matrix in such a way that a 
direct application of GLS gives an estimator obeying the constraints. For a 
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three-equation version of Eqs. (8-82) the arrangement would be 


(8-93) 


* 
i] 
cooco= 
o-s 
J 
& 
cs 
1 
* 
o 
cs 
= 
“f 
| 
v 


where i denotes a column vector of n units and x, the n observations on 
In(M/P,), and so on. Repeating the x, vector in a single column of the data 
matrix ensures just a single estimate of its coefficient. The efficient estimation 
procedure for Eq. (8-93) is still the Zellner SURE technique since the disturbance 
variance matrix has the form of Eq. (8-86). There is a moot point whether the 
estimates of the elements of = in the first stage of the technique should be 
obtained from the application of OLS to each equation separately, as previously 
described, or from the application of OLS to Eq. (8-93), which incorporates the 
restrictions implied by the null hypothesis. The two procedures are equivalent 
asymptotically. 

Looking now at the cost share equations (8-87), unrestricted estimation of the 
equations would be achieved by OLS, even with a nonspherical disturbance 
matrix, since the matrix of explanatory variables is identical in each equation. 
However, the test of the symmetry restrictions still requires the computation of 
the test statistic (8-91): the GLS b, in that formula is now replaced by the OLS b, 
but the elements of the 2 matrix must be estimated from the OLS residuals as 
before. The specification of the appropriate R matrix and r vector for the test of 
the symmetry conditions is left as an exercise for the reader.} Equations (8-87) 
already embody summation restrictions on both a’s and ’s. One may wish to 
test these restrictions before looking at Eqs. (8-87), or one may wish to test the 
complete set of summation and symmetry conditions.¢ Finally if one wishes to 
estimate Eqs. (8-87) with the symmetric restrictions imposed, the Zellner SURE 
technique is again required, as in the addilog example. The details are left as an 
exercise. 

The choice of which of the four share equations to drop in obtaining the set 
of three equations (8-87) is an arbitrary one. The SURE estimates are not 
invariant to the choice of equation to drop. However, iteration of the SURE 
technique will produce parameter estimates that converge to the ML parameter 
estimates, which are unique and independent of the equation omitted.§ 


¢ See Problem 8-3. 

+ See Problem 8-4. 

§ See the very useful discussion and references to other relevant papers in E. R. Berndt and L, R. 
Christensen, “The Translog Function and the Substitution of Equipment, Structures and Labor in 
U.S. Manufacturing, 1929-68,” Journal of Econometrics, vol. 1, 1973, pp. 81-113. 
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The iterative process goes as follows. 


1. Compute the s,, from the OLS residuals, as described above, and hence 
obtain 2. 

2. Compute the elements of 2~' and substitute in Eq. (8-88) to compute b,. 

. Using b, compute a new set of residuals e, = y — Xb,. 

4. Partition e, into the subvectors corresponding to each equation and use these 
subvectors to compute new s,,, thus starting the process over again. 


w 


ij? 


The SURE process may be further complicated to allow for autocorrelation 
in the disturbance terms, but we will not pursue that topic here.} 


PROBLEMS 


8-1 Derive the results on the efficiency of the OLS estimator under the two forms of heteroscedastic- 
ity, given in Egs. (8-34a) and (8-346). 
8-2 Prove that the SURE estimator in Eqs. (8-88) reduces to the application of OLS to cach equation 
Separately if 

(a) 9,, = 0 for alli *j 
or 

(b) X, =X, = +> =X,, 
8-3 Specify the R matrix and r vector for testing the symmetry conditions in the set of equations 
(8-87), 
8-4 Consider the four cost share equations prior to Eqs. (8-87) and explain how to test the full set of 
summation restrictions (on a’s and B's) and symmetry conditions. Which, if any, of these restrictions 
might be satisfied exactly by the estimated coefficients? 
8-5 Consider a heteroscedastic model (for which all other classical assumptions hold) 


Yj, mat BX, + wy i= 1,2,..-.m (m> 1) 


fengape..e, (ny >'2) 


Suppose var(u,) = 07. A sample estimator of 0, is 


where 


Determine E(s?). 
Cea) (University of Michigan, 1981) 


+See R. W. Parks, “Efficient Estimation of a System of Regression Equations when Disturbances 
Are Both Serially and Contemporaneously Correlated,” Journal of the American Statistical Association, 
vol. 62, 1967, pp. 500-509, for a treatment of first-order serial correlation; and see G. G. Judge, W. E. 
Griffiths, R. C. Hill, and T. C. Lee, The Theory and Practice of Econometrics, Wiley, New York, 1980, 


Chap. 6, for more general cases. 
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8-6 In the model 
Vir = Oxy, + Uy, 


Yar = BX2, + Uy 


the x,, are nonrandom exogenous variables and the u,, are serially independent random disturbances 
that are normally distributed with zero means and second moments 


E(uz)=1  E(u3,)=2 — E(uyua,) = 1 


for all values of ¢. The sample second moment matrix below was calculated from 20 sample 
observations: 


Wi Y2 M1 


yf] W-1 1-1 
Ful lost 
x H Sellyt 


(es ee 
(a) Find the best linear unbiased estimates of the parameters a and B. 
(6) Test the null hypothesis 
Ho: a=B 
against the alternative H,: a * B. 
(University of Michigan, 1980) 


CHAPTER 


NINE 
LAGGED VARIABLES 


We will use the term “lagged variables” to cover the inclusion on the right-hand 
side of the regression equation of lagged-values of the explanatory variables, the 
X’s, and/or lagged values of the dependent variable Y. 


9-1 SOURCES OF LAGGED VARIABLES 


Realistic formulations of economic relations often require the insertion of lagged 
values of the explanatory variables. For instance, a rise in “permanent” income is 
likely to have an effect on consumption, which is distributed over a number of 
time periods, or a change in investment allowances may be expected to result in 
changed investment allocations, which, in turn, will have an effect on actual 
investment spending spread over a number of time periods because of production 


and other lags. { 
In general let us suppose that a causal variable X, exerts a distributed lag 


effect on Y as follows: 


Period t t+1 t42 (pesos) 
x, 
/ Sao 
Effect on Y 5X, 6X, 5,X, 5; X, 
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Assuming the lag pattern to persist through time, any Y, is seen to be built up as 
the sum of effects from current and previous values of X. Thus the lagged effect 
assumed above would generate the relation 


¥, =p + 8X, + 8X, + 8X2 + 8X3 + u, 
where we have also allowed for an intercept and a disturbance term. In practice 


one does not usually have any strong a priori information about the maximum 
length of lag, and one formulates the general relation 


Y,=p+ D(L)X, + u, (9-1) 
where D(L) is a polynomial of some degree s in the lag operator, that is, 
D(L) = 6) + 8,L +--+ +6,L* (9-2) 


If X has remained constant at some level X for s periods, then, apart from 
disturbances, Y will have reached an equilibrium value 
Y=p+ D(I)X 


where D(1) indicates the value of the polynomial when L is replaced by unity, and 
is simply the sum of the 5, coefficients, namely, 


DiI) = F, 


i=0 
If X changes in period ¢ by an amount A X, and is then held constant at the new 
level, Y will gradually adjust from Y to a new equilibrium. The changes are 
Period t t+ t+2 
Change in Y 64 X, 8,AX, 8,AX, 
The coefficient 55 (= AY,/AX,) thus represents the impact multiplier for X. 
Partial sums of the 6's indicate intermediate multipliers. The 5°s may also be 
standardized by dividing by their sum D(1). Partial sums of the standardized 8's 
then indicate the proportion of the total effect achieved by a certain period. For 
example, knowledge of the 5’s enables one to estimate how many periods must 
elapse before, say, 90 percent of the total effect is achieved. An important concept 
is that of the median lag, which is the number of periods required for 50 percent 
of the total effect to be achieved. When all the 8’s are positive, another useful 
statistic is the mean lag defined as 
Ljeoid, 8, + 28, +--+ +56, 
mod, 8p + 8, + 8, + +++ +8, 
From Eq. (9-2) it is seen that differentiating D(L) with respect to L gives 
D(L) = 8, + 28,L +--+ + 58,L5~" 
D'(1) 
D(1) 
As an illustration suppose an estimated version of Eq. (9-1) yields 
D(L) = 0.10 + 0.25L + 0.35L? + 0.15L3 + 0.05L4 


Thus Mean lag = 
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Then D(1) = 0.90 
so that the final or total effect of a unit change in X is a change of 0.9 in Y. Also 
D’(1) = 0.25 + 0.70 + 0.45 + 0.20 = 1.60 


and the mean lag is computed as 1.6/0.9 = 1.78 periods. The standardized 
coefficients and their cumulated values are as follows: 


Period 0 1 2 3 4 
Standardized coefficients 0.11 0.28 0.39 0.17 0.05 
Cumulated values OAL 0.39 0.78 0.95 1.00 


The median lag would be computed by interpolation as 
0.50 — 0.39 
0.78 — 0.39 

In practice the maximum lag s may have to be fairly large to provide an 
adequate representation of the relationship between Y and X. It is frequently 
possible to achieve a more parsimonious representation (that is, using a smaller 
number of parameters) by postulating a distributed lag on both Y, and X, as in 


Wer = 1.28 periods 


A(L)(¥, — #) = B(L)X, + & (9-3) 
where A(L) =1-a@L=-*+— a,b? (9-4a) 
B(L) ={) + 8,L+--: + BL? (9-4b) 


and it is expected that p + q will be less than s. The stability of Eq. (9-3) imposes 
conditions on the «’s, which may be expressed in the form that the roots of A(L) 
lie outside the unit circle.t Relation (9-3) may be rewritten as 
B(L) 
rata (9-5) 
where we are assuming the disturbances to be related by v, = A(L)u,. Compari- 
son of Egs. (9-1) and (9-5) gives 


B(L) _ : 

atEy7 mo 
As an example, suppose that ACE UR (9-Ta) 
and B(L) = Bo + BL (9:75) 


Then¢ 


ch = By + (mfp + BL + a, («By + B,)L? + at (a, By + Aye tr: 


+See C. E. P. Box and G. M. Jenkins, Time Series Analysis: Forecasting and Control, revised 
edition, Holden-Day, San Francisco, 1976, pp. 53-54. 
$Note that (1 — a,L)7'=1+a,L+ afl? +-:-. 
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and the correspondence between the a’s, B’s, and 8’s is 


§ = Bo 

5, = a By + By 

5, = a9, ee 
6, = a,5, * 


and so on, Two first-order polynomials A(L) and B(L) thus generate an infinite 
polynomial for D(L), but there are, of course, implied restrictions on the 5’s. As 
shown by Eq. (9-8), the first two 6’s are “free” and subsequent 6’s decline 
exponentially. They decline since the stability requirement that the root of 


A(L)=1-a,L=0 
lie outside the unit circle ensures that a, (= 1/L) has modulus /ess than unity. 


The same condition ensures that the infinite sum of the 5’s converges. From Eq. 
(9-8) this sum is seen to be 


D(1) = By + 


— Bot Bi 


l-aq 


_ 5() 
~ A(1) 


Clearly, extending the power of the B(L) polynomial would extend the number 

of “free” 5 coefficients before the exponential decline sets in. The mean lag may 

also be derived from the A(L), B(L) polynomials. Since D(L) = B(L)/A(L), 
D(L) _BY(L)_ A(L) 


D(L) BL) A(L) 


a Bo + Bi 
1a; 


and so 


Bl) A‘() 
Mean lag = —— — —— 9-9 
Si Bap ATO CF 
This expression is often easier to compute than the equivalent expression in terms 
of the 6’s. For the above example 


By To Bo + By 


Mean lag = - = 
Se Ane Bo+B, 1-a (1 —«)(By) + By) 


The Koyck Scheme 


We have seen that, starting with an equation like Eq. (9-1), which involves only 
lagged X values on the right-hand side, one may be led by considerations of 
parsimonious parameterization to reformulate it as in Eq. (9-3), which introduces 
lagged values of the dependent variable Y among the regressors. When the A(L) 
polynomial is just of the first degree, as in Eq. (9-7a), we have an example of a 
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Koyck scheme of declining exponential weights.} The simple Koyck scheme has 
the coefficients on the X’s declining exponentially from the start, that is, 


6, = a,5,-; Posey 
This corresponds to the specification 
A(L)=1-@L and B(L)=Bp 
and the relationship may be formulated as 
Y, = p+ 8X, + a 8X1 + Oy Mie heats (9-10) 
or, equivalently, as 
(1 = @L)(¥, — #) = BoX + % 


which may be written 


y,=2( — a) + mY + BX, + % (9-11) 
Equivalence between Eqs. (9-10) and (9-11) requires 
By = 8 
and v, = (1 — Lb) uy = Me — et (9-12) 


Thus if the original disturbances {u,) in Eq. (9-10) are serially independent, the 
transformed disturbances {v,) in Eq. (9-11) are serially dependent, which has 
implications for the estimation procedures to be considered in Sec. 9-2. Since 
A(1) = 1— a, A) = —% B(1) = By, and B'(l) = 0, the mean lag for the 
simple Koyck process is a,/(1 — 4%). As has already been indicated, raising 
the degree of the B(L) polynomial, while retaining A(L) = 1 — aL, increases 
the number of “free” coefficients before the Koyck exponential decline comes 
into play. 

So far we have considered the distributed lag effect of just a single explana- 
tory variable. Suppose there are two explanatory variables, each with a Koyck lag. 
There may be no a priori reason to expect an identical decay parameter in each 


lag. Thus the relation might be formulated as 
Y,=p+ BX, + a, BX,_\ + OBA s toon” it YZ, 


4 asyZ,.) + 0gYZ,-2 + 07° + Me (9-13) 
or Yyant ogee t Tah 
which gives 
Y,= pt + (m+ a) ¥-1 — aa ¥,-2 + BX, — BX. + yZ,— HZ, + % 
(9-14) 
where pe= pl - a,)(1 — @) 
and v, =u, — (a + ay) py + %AQMy-2 


+L. M. Koyck, Distributed Lags and Investment Analysis, North-Holland, Amsterdam, 1954, 
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so that, compared with the single-variable Koyck scheme in Eg. (9-11), we have 
two lagged values of Y and lagged values of each explanatory variable. For 
estimation purposes the essential point to notice about Koyck schemes is that 
they may be formulated either with only lagged values of explanatory variables on 
the right-hand side, as in Eqs. (9-10) and (9-13), or with lagged Ys appearing on 
the right-hand side, as in Eqs. (9-11) and (9-14). The former have nonlinear 
restrictions on the parameters combined with presumably “well-behaved” dis- 
turbance terms, while the latter have a dramatic reduction in the number of 
right-hand side variables, but “complicated” disturbance terms and sometimes 
restrictions on the coefficients [as in Eq. (9-14) but not in Eq. (9-1 1)}. 


Adaptive Expectations 


Lagged dependent variables may also-appear among the regressors in various 
expectational models. A firm may base its production rate Y, not on the current 
sales rate X,, but on the expected, permanent, or trend sales rate X}. Thus one 
may specify 
Y,=a + BX* + u, (9-15) 

where a disturbance u, has been included to allow accidental over- or under- 
achievement of the production target. Equation (9-15) is not usually statistically 
operational since there is a dearth of published information on expected or 
forecast sales rates and similar variables. It is therefore customary to add an 
auxiliary hypothesis about the formation of expectations, and one of the most 
widely used (if not, indeed, abused) schemes is that of adaptive expectations, 
which is that expectations get updated each period on the basis of the latest 
information about the actual value of the variable. The formal specification is 

Xp — XE =(1-A)M(X-— XA) OSA<1 (9-16) 
In this formulation X¥ indicates the expectation formed at the end of period ¢, 
when the information about the current level X, has become available. If expec- 
tations were formed at the beginning of the period, X, in Eq. (9-16) should be 
replaced by X,_,. If X= 0 in Eq. (9-16), the expected value adjusts period by 
period to the current observation and all previous history is irrelevant. If \ = 1, 
an expectation, once formed, continues unchanged, irrespective of current or 
earlier observations. The intermediate and more realistic case of A being a positive 
fraction means that expectations get adjusted each period by some proportion of 
the discrepancy between the latest observation and the expectation for that 
period. Low values of \ imply substantial adjustments in expectations, and large 
values imply slowly changing expectations. 

Equation (9-16) may be reformulated as 


(1-AL)x? = (1 -a)x, 


1-2 
or x npoypee (9-17) 


This in turn may be written 
XP = (1 A)X, + AL -A)X,_, + RL — A) X,_4 + 
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so that the adaptive expectations hypothesis gives the current expectation as a 
Koyck-weighted combination of the current and all previously observed values of 
the variable in question. Substitution of Eq. (9-17) in Eq. (9-15) then gives 
B(L-A 
aca cops, ie 

or 

Y, = a(1—A) +A¥,_, + B(L—A)X, + (u, — Au) (9-18) 
This equation is formally identical to the simple Koyck scheme in Eq. (9-11) in 
terms of the variables included, the MA(1) disturbance process, and the fact that 
the parameter of the MA(1) process is also the coefficient of the lagged dependent 
variable. 


Partial Adjustment 

Another process which can generate lagged dependent variables among the 
regressors is that of partial adjustment. Consider the adjustment of gasoline 
consumption to a substantial price rise such as that engineered by OPEC in 
1973/1974. Initially the scope for economies in consumption, even in the face of 
very substantial price rises, was limited by such factors as 


1. The existing geographical distribution of residences and work places 
2. The existing stock of vehicles 
3. The existing supply of alternative transport systems 


In the short run, economies could be made in shopping and vacation trips, 
car pooling on work trips, and so forth, In the longer run, one expects adjust- 
ments in the more fundamental factors, such as the fuel efficiency of the vehicle 
fleet. Such adjustment has its own costs and, in any case, must take time to be 
achieved. Thus one may postulate ¥;*, the optimal consumption rate appropriate 
to a gasoline price of X,, with income and other factors being held constant, as 

Yt =a + BX, (9-19) 


For reasons such as those suggested one would not expect actual consumption Y, 
to adjust completely to X, in period 7. Instead, a partial adjustment process is 
frequently specified as 
¥~¥.,-(-A-Yatu Osrs1 (9-20) 
Notice that no disturbance term has been inserted in the calculation of the 
optimal Y;*, but it would seem essential to include one in the specification of the 
actual Y, in Eq. (9-20). An alternative form of Eq. (9-20) is 
(1-AL)Y,=(1-A)¥P +4, (9-21) 


which, in turn, gives 
Y= (1—A)¥E + ACL ADEA, ENCE AY HE a Hoe + (ue = Ate) 
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so that the current consumption rate is a Koyck-weighted combination of current 
and all previously desired rates. Substitution of Eq. (9-21) into Eq. (9-19) gives 


(1 —AL)Y, = a(1 — A) + B(1-A)X, + u, 
or Y,=a(1—A)+A¥,_, + B(I-A)X, + u, (9-22) 
Again notice the formal equivalence of Eq. (9-22) to the adaptive expectations 
equation (9-18) and the Koyck-weighted lag scheme in Eq. (9-11). The only 
difference is that in Eq. (9-22) the disturbance term may have simpler properties 
than in the other two cases. 


Demand functions are frequently specified in constant elasticity form. Thus 
Eq. (9-19) could be respecified as 


Yf = AXP (9-23) 


The partial adjustment process would then have to be specified conformably as 
y ys \!-A 
(4) ao (z5} eM (9-24) 


Combining the two relations produces 
Y, = AAV A XA A eu 
which, using lowercase letters to denote natural logarithms, gives 
y =a(l—A)+Ay,_, + BO —A)x,+u, (9-25) 


Since Eq. (9-25) is double logarithmic, the coefficients represent elasticities. Thus 
the short-run (or impact) elasticity of Y with respect to X is BC — X), while the 
long-run (or full adjustment) elasticity is seen from Eg. (9-23) to be f. If the 
estimated form of Eq. (9-25) is denoted by 
I, = Cy + CH) + CX, 
then 
Estimated short-run elasticity = c, 


Estimated adjustment parameter = ra 


Cpr 
=, 


Estimated long-run elasticity = 1 


The partial adjustment process specified in Eq. (9-20) has been widely used in 
applied work because of the simplicity of the resultant estimating equation, such 
as Eq. (9-25). Nonetheless it implies a pattern of adjustment that may sometimes 
be implausible. Suppose ¥ had been constant at X sufficiently long for Y to have 
settled at the desired level, ¥Y = a + BX. In period ¢ we assume X to become 
X +4X and then to remain at the new level indefinitely. The new desired Y is 
given by Y= a + B(X +AX), and the adjustment to that level implied by Eq. 
(9-20) for a A value of, say, 0.5 and a negative B is shown in Fig. 9-1. 

In the first period one-half of the total desired adjustment is achieved; in the 
second period one-half of the remaining adjustment is accomplished, and so 
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Y 

n 

¥ 

| ee ae ea 

oO n ate in 1 s aS”, 
i=l t t tal Apt tai 4% 

Figure 9-1 


forth. Thus the maximum adjustment is achieved in the first period, and each 
successive adjustment is a fraction d of the previous adjustment. This might be a 
plausible reaction pattern for, say, the consumption of broiler chickens in 
response to a significant price change, but it is less plausible for the consumption 
of gasoline since that consumption is mediated through durable equipment. 

A further difficulty with the simple partial adjustment process arises when Y* 
is a function of more than one explanatory variable. Suppose, for example, the 
optimal level of energy consumption depends on both the relative price of energy 
and the level of output in the economy. Applying the partial adjustment process 
to actual energy demand imposes the same adjustment parameter on each 
explanatory variable. Even if the form of the adjustment process is similar for 
each variable, the speed of the process may well be different. Thus at given prices, 
one might expect energy consumption to move more or less in step with output, 
but to react much more slowly to price changes. 


Combination of Adaptive Expectations and Partial Adjustment 


Suppose X* represents “permanent” or long-run income, and Y* the correspond- 
ing level of “permanent” or long-run consumption.} One might then write 


Y* =a + BX* (9-26) 


This is not an operational equation since there are no direct observations on the 
variables. However, the adaptive expectations hypothesis may be used to explain 
X* and partial adjustment to explain the adjustment of Y to Y*. Thus combining 
Eqs. (9-17) and (9-21) with Eq. (9-26) and allowing the \ parameter to be different 


+See M. Friedman, A Theory ofthe Consumption Function, Princeton University Press. Princeton. 


NJ, 1957. 
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in the two processes gives 
(L-A,L)Y,=(1-A,)¥* +4, 
= a(1—A,) + B(1—A,) XP + u, 


BQ —A,) —>>) 
=a(1 oA) 4 Sgt (027) 


or 
¥, = a(1 — A, )(1 — AQ) + (Ay + AQ) ¥_) — AADY_-2 
+B(L—A,)(1 — A) X, + (u, — Azu,_1) 


The parameters A, and A, appear symmetrically in the systematic part of Eq. 
(9-27). Thus if one ignores the structure of the disturbance term and runs a 
regression of Y, on ¥,_,, ¥,_, and X,, the resultant coefficients would not yield 
estimates of the separate lag parameters A, and A,. The sum A, + A, and the 
product A,A, can be estimated directly, and hence the term (1 — A,)(1 — A,) is 
estimable and so are a and £. However, taking account of the structure of the 
disturbance term can lead to estimates of the X’s, as will be shown in Sec. 9-2. 


9-2 ESTIMATION METHODS 


Let us begin with the estimation of the distributed lag function (9-1), that is, 
Y,= mbt 8)X,+ 6X, +--+ + 8X, tu, (9-28) 


where, for simplicity, we restrict consideration to the lagged values of a single 
explanatory variable. We usually cannot expect theory to indicate the maximum 
length of lag, but one would ordinarily expect significance tests on the 8’s to give 
some indication both of the maximum lag length and of any delay in the initial 
transmission of an effect from X to Y. The validity of such significance tests 
depends on the properties of the disturbance process {u,} and the associated 
estimation methods. If E(u) = 0 and var(u) = 071, then, in principle, OLS would 
be an appropriate estimation technique. In practice, however, its application is 
likely to be plagued by collinearity between the tegressors, leading to great 
imprecision in the estimates of the 8’s. 


Almon Lags 


A general strategy for dealing with this collinearity and the associated imprecision 
is to reduce the number of parameters to be estimated by the assumption of some 
pattern for the 6’s. The Koyck scheme of Sec. 9-1 is perhaps an extreme example 
of such a pattern. The Almon lag scheme provides a more flexible method for 
reduced parameterization.} Under the Almon scheme one rules out the direct 


+S. Almon, “The Distributed Lag between Capital Appropriations and Expenditures,” 
Econometrica, vol. 30, 1962, pp. 407-423. 
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(a) (b) 


Figure 9-2 


approach of attempting to estimate all (s + 1) 6’s and assumes instead that the 
8’s can be approximated by some function 8, = f(i), as in Fig. 9-2b. The basis of 
the approximation is Weierstrass’s theorem, which states that a function continu- 
ous in a closed interval may be approximated over the whole interval by a 
polynomial of suitable degree, which differs from the function by less than any 
given positive quantity at every point of the interval.} 

As an illustration suppose we postulate a third-degree polynomial, that is, 


f(i) = aq + OE + OI? + @,i° 


Then approximately 


8 = £(0) = &% 
8, =f(1) = a + % + 92 + M5 
8, = f(2) = a + 2a + 4a, + 8a, (9-29) 


8, = f(3) = @% + 3a, + 9a + 27a; 


8, = f(s) = % + 5a + 57a, + s°a, 
Substituting Eq. (9-29) in Eq- (9-28) and rearranging gives 
Y, = pt a(X, + X-1 + X-2 + Kins tort Ma), 
+0,( Xj 42X22 + 3Xes te" + sX,_,) 
0,(X,1 + 4X2 + 9X3 tt s?X,_,) 
$04(X;_1 + 8X2 + 27X30 + 9?X,_,) + uy (9-30) 
rmed as linear combinations of the lagged X’s. 


Thus four new regressors are fo! agee 
The regression of Y on these variables yields estimates of the a’s, which in turn 


+R. Courant, Di rerential and Integral Calculus, vol. 1, 2d edition, Blackie & Son, Glasgow, United 


Kingdom, 1937, p- 423. 
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yield estimates of the 6’s from Eq. (9-29). The sampling variances and covariances 
of the 5’s can be computed from those of the @’s and significance tests carried out 
on the 6’s. Defining W, as the matrix of coefficients in Eq. (9-29), 


Wy 20). 0: +0 
1) Gen fool 1 
W,-|1 2 4 8 
er eee 
1 ss?) 533 


where the subscript 3 indicates the use of a third-degree approximating poly- 
nomial. Equation (9-29) then becomes 


§ = Wa (9-31) 
and, given &, 
§ = waa (9-32) 
The matrix form of the original equation (9-28) is 
y=in+X8+u 
Using Eq. (9-31), 
y= ip + XWin +u 


An OLS regression of y on {i XW,], where XW, is the matrix of observations on 
the “new” regressors shown explicitly in Eq. (9-30), gives the estimated coefficients 


Ay ii XW, ! i'y 
@}  [Wxi Wixxw,} | wex’y 
with 
-1 
var(&) = 02 [wexxw, 3 A WXWXW, | (9-33) 
From Eqs. (9-31) and (9-32) 
E(8) = W,£(&) = 8 
and 
var(8) = W, « var(@) - W; (9-34) 
Substitution of Eq. (9-33) in Eg, (9-34) then gives the matrix of sampling 
variances and covariances for the §’s. 

The above would be very useful if, in fact, one knew the appropriate degree 
for the approximating polynomial. In practice the determination of that degree is 
an important problem, even given an assumption about the maximum lag length. 
The problem may be approached in two ways. From Eq. (9-30) it is seen that the 
coefficient of the last “new” regressor a; is the coefficient of the highest power in 
the approximating polynomial. Testing the Significance of a, is, in effect, asking 


whether we need a third-degree polynomial, However, finding a, insignificant 
does not necessarily imply that higher-order a’s would also be found insignificant. 
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The recommended procedure would be to start with a fairly high degree of 
polynomial, say, fourth or fifth, test the last coefficient for significance, and keep 
reducing the degree of the polynomial until the last coefficient is found significant. 
The disadvantage of this procedure is that, in order to carry out the tests, various 
“new” regressors have to be computed which may not in fact be required in the 
final regression. 

The second approach avoids this computational diffreulty. Tests of the degree 
of the approximating polynomial can be based on the unrestricted OLS estimates 
of Eq. (9-28) for some assumed value of s, and once the degree has been 
determined, the Almon estimators can be found by an application of restricted 
OLS estimation. Consider again a third-degree approximation given by 


6, =a) + wit ai? + ai? 
Taking the first difference of this function gives a polynomial of the second 
degree, and so on, for each successive difference until} 


A*6,=0 
But A6, = 6, — 5, 
AS, = (8, — 8-1) — (8-1 - 8-2) 
= 6, - 28, +6-2 


A°8, = 8, — 36,_, + 38,2 — 8-5 
M48, = 6, — 46,_, + 68, — 48,5 + 8-4 
Thus the assumption of a third-degree polynomial places a set of linear restric- 
tions on the 5’s. The full set of restrictions is 
6, — 48, + 66, — 46, + 6) = 0 
5, — 46, + 68, — 45, + 8, = 0 (9-35) 


8, — 46,_, + 68,7 — 48,3 + 8, = 0 


A, = 5, - 8), 
= a9 + ai + ai? + 031? 
ay — 0y( - 1) ~ a (4-1)? = a,(1- 1)? 

= (a, — a2 + a3) + (2a, — 3a,)i + 305i? 

The second difference of 8, is found by repeating the first difference operation. Thus 
28, = (2a; — 303) i + 3a3/? — (2a; ~ 3a3)(i~ 1) ~ 3as(4 - 1 
= (2a) — 6a;) + basi 

The degree of the polynomial in i decreases by | with each differencing. The third and fourth 


differences are then 
88, = 6a; 


and 45, = 0 
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No restriction involves the intercept term . Thus the restrictions (9-35) may be 
expressed as 


r,|§ =0 (9-36) 
where R, is the (s — 3) x (s + 2) matrix 
0 =a 6s \—4 1 0 sai 0 
Dee Usaant ads mo helped wich lies Cae 


Ry= 


0 


A second-degree approximating polynomial would imply the set of s — 2 linear 
restrictions given by 


a 
r, |S ]-0 
where R, is the (s — 2) X (s + 2) matrix 
0 -!l 2) 3: 1 0 tee 0 
0 0; =! 2) 3 1 tte 0 
R,= : - rf (9-37) 


0 


Notice that the nonzero elements in the rows of the R matrices are given by the 
appropriate set of binomial coefficients with alternating signs.t If r denotes 
the degree of the approximating polynomial, the nonzero elements in R, are the 
coefficients of L in the polynomial (1 — L)’*', but in reverse order. However, 
since the restriction sets linear combinations of the 5’s equal to zero, we can 
multiply the rows of R, by —1 and get the coefficients in natural order. 

For a given maximum lag s the sequential procedure for finding a suitable 
degree of the approximating polynomial would be as follows. 


1. Start with a polynomial of fairly high degree, say, the fourth or fifth. 
2. Set out the corresponding R matrix and test the null hypothesis 


BL 
Ay: {5 =0 


by substituting in Eq. (5-68) the results of the unrestricted OLS estimation of 
Y,=pt 6X, +6,X,, +--+ +6X_,+u, 


+ The binomial coefficients may be simply obtained from Pascal’s triangle 
1 Almon-polynomial 


1 a 3 1 second degree 
1 4 6 4 1 third degree 


where an internal element in any row is the sum of the pair of elements immediately above. 
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Under the null hypothesis the resultant test statistic has the F(s — r,n — 5 — 
2) distribution, where r is the degree of the approximating polynomial. 

3. If the null hypothesis is rejected, the initial polynomial has not been of 
sufficiently high degree. 

4. If the null hypothesis is accepted, proceed to the next lower degree and test 
the new set of linear restrictions, proceeding in this way until the null 
hypothesis is rejected. 


If the null hypothesis is accepted, say, for R, but rejected for R,, the 
appropriate procedure is to find a third-degree approximating polynomial. This 
may be done by computing the four “new” regressors specified in Eq. (9-30), 
estimating the a’s by OLS and then using Eq. (9-32) to estimate the 6’s. 
Alternatively one may use the formula for the restricted estimator given in Eq. 
(6-5), and inferences may be made by using the variance matrix given in the 
footnote to Eq. (6-5). 

The above procedure is conditional on some assumed value for the maximum 
lag s. It may be repeated for various values of s and a judgment made by looking 
at the overall fit and the significance of the higher-order 5’s. 

An implication of the Almon procedure, which does not seem to have 
attracted much attention, is that it is likely to yield biased and, indeed, incon- 
sistent estimates. Write the original model, Eq. (9-28), for simplicity as 


y=Xd+u (9-38) 
If the 6’s do not lie exactly on the approximating polynomial, then a formula 
such as Eq. (9-31) has to be amended to 
8=Watv (9-39) 
where v is an r X 1 vector of errors involved in the use of an rth-degree 
approximating polynomial. Notice that v is independent of time and is a vector of 
unknown constants, which does not vanish with increasing sample size. Substitut- 
ing Eq. (9-39) in Eq. (9-38) gives 
y = XWa + (Xv + u) (9-40) 
In Eq. (9-40) there is obviously some correlation between the explanatory 


variables XW and the expanded disturbance term Xv + u, which would lead one 
to expect inconsistency in the estimation of and hence of 6. Looking directly at 


the estimator of 5, 
§ = We 
= w(W’x’xw) 'W’X’y 
-1 
es wiw(ixx)w| w{ (+xx}s + +xu} 
n n n 
Assuming 
plim2 (XX) = 2 xx 
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and 
plim( Xu) =0 
we have 
plim§ = W[W’S,,W]~'W’2,,8 (9-41) 
Substitution of Eq. (9-39) in Eq. (9-41) gives 
plim§ = 8 + W[W’2,,W] 'W’S,,v (9-42) 


so that the Almon estimator is inconsistent unless the unknown 6’s lie exactly on 
the chosen polynomial, in which case y = 0. The finite sample bias of the Almon 
estimator can be serious if one fits a polynomial of too low degree. This bias, 
combined with the smaller sampling variation (as compared with unrestricted 
OLS), can sometimes give sampling distributions for the Almon estimators which 
fail to contain the true 6 parameter altogether or else have it located near an 
extremity of the distribution. 

Computer packages with Almon lag estimators usually offer the facility of 
including end-point restrictions such as 8_, = 0 and/or 6,, , = 0. Since 6_, is the 
notional coefficient of X,,, and that variable has no effect on Y,, it might seem 
sensible to incorporate that end-point constraint. As Dhrymes and Schmidt and 
Waud have pointed out, that is a fallacious argument.} Setting 5_ , = 0 implies a 
restriction on the a’s and hence on the 8’s, which in turn is a restriction on how 
X,, X,_1,--., X,_, affect ¥,. For a second-order polynomial the implied restriction 
is 


A —a,+a,=0 

Such a restriction could, of course, be tested by estimating the a’s and using the 
variance matrix in Eq. (9-33). The purpose of the Almon polynomial is to give a 
good approximation to the unknown 6°s over the interval 0 to s. Its behavior if 
extrapolated outside that interval is irrelevant. The second end-point restriction, 
B,,., = 0, may not produce much distortion in the approximation if the coefficients 
are decaying with increasing lags but, again, it implies a restriction on the a’s and 
5's, and there seems little valid reason for imposing it. 


Direct Estimation of a Koyck Lag 
If one assumes Eq. (9-28) to obey a simple Koyck lag, the relation becomes 
Y¥,= p+ 6X, + adX,_, + o76X,_, +--+ + u, Ja| <1 (9-43) 


with four parameters to be estimated, namely p, 8, a, and 02. The lag is now 
infinite, but the coefficients decay exponentially. The relation (9-43) may be 


+P. J. Dhrymes, Distributed Lags: Problems of Estimation and Formulation, Holden-Day, San 
Francisco, 1971, pp. 232-234; P. Schmidt and R. N. Waud, “Almon Lag Technique and the Monetary 
versus Fiscal Policy Debate,” Journal of the American Statistical Association, vol. 68, 1973, pp. 11-19. 
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rewritten as 
Y,=pt+6(X,+aX_,+-+°+ a’~'X,) + a'6(X) + aX_, + +77) tu, 
or Y,=p+ 6XfF + aly + uy, (9-44) 
where 
XP eX, OAs ete +a‘ !X, 


oo 
and y=8 Da@x_,= E(%—H) 

i=0 
The y parameter may be regarded as the expected difference between Y and pin 
the period preceding the first sample observation. If u ~ N(0, 021), the applica- 
tion of OLS to Eq. (9-44) would yield ML estimates. The matrix formulation of 
Eq. (9-44) would be 


1 XP @ |rp 
y=|1 Xf a? | +u (9-45) 
cae ah 


where the X*’s may be built up recursively as 

XP=X, 

Xt = X, + aX, = X, + aX} 

Xt = X, + aX, + a*X, = X, + aXxz 
Since two columns of the data matrix depend on the unknown «a, one can proceed 
with a grid search over the interval 0 < a < 1. For each specified value of a the 
data matrix in Eq. (9-45) is computed and OLS applied, the final choice of 
regression being based on the minimum residual sum of squares. The standard 


errors for ji and & from the OLS program would only be correct if a were known 
exactly, which is not the case. The asymptotic standard errors can be obtained 


from the information matrix, which ist 
ox} 4 
nm Ske eae 2(aGe + ta’ 'y) 


axr 
exp? Eaxt O89 + salty) x0 


m 
6) = + (9-46) 
R war * 
Y o2 21 ax? cil ) t 
ee lu La! na Ge + ta x }a' 
* 2 
2G + ra'-') 
da 
This matrix is symmetric and so only the upper triangular portion has been 
d by their 


shown, The unknown parameters in Eq. (9-46) would be replace 


+ See Problem 9-4. 
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estimated values, and the inverse would give the estimated variance matrix for the 
parameters. 


Estimation with a Lagged Dependent Variable 
Instead of estimating the simple Koyck scheme directly, as above, one might use 
the derived relation 

Y, = w(1 — a) + mY, + BoX, + (u, — au,_,) 
already established in Eq. (9-11). The adaptive expectations scheme is formally 
identical, as shown in Eq. (9-18). The partial adjustment model, derived in Eq. 
(9-22), gives 

Y,=a(1—A) +A¥,_, + B(I—A)X, + u, 

Both relations incorporate a lagged Y among the regressors and differ only with 
respect to the properties of the disturbance term. We must now examine the 


estimation problems occasioned by the lagged Y value, and we shall do so under 
various assumptions about the disturbance term. 


Lagged Dependent Variable and Well-Behaved Disturbances 
Consider the relation 
¥,= B, + BX, + BY, + u, (9-47) 


where we assume the u's to be independently and identically distributed with zero 
mean and variance 07. The relation (9-47) may be rewritten to show the 
dependence of Y, on the stream of current and previous values of X and u, that is, 


(1- BL)Y, = B, + BX, + u, 


giving 
Y= a+ BX, + B,X,-, + BX, +--+) +0, (9-48) 
B 
where (jpn eed LE 
Saglies B; 
and v, = (1- BL) ‘uy, 


If X were held constant at some level X and ¥ denotes the corresponding level of 
Y, then 


Fy Bos 
E(Y) lok. 1-2 
provided |8;| < 1. If |B;| > 1, E(Y) would explode. In practice a (Y,} series may 
have explosive tendencies, which are held in check by various “floors” and/or 
“ceilings.” A model of such a process would be highly nonlinear, and the 
Statistical treatment of such models is still in its infancy. We therefore impose the 
constraint 


|B3| <1 (9-49) 
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It is also clear from Eq. (9-48) that expressions such as (LY?/n) and (©X,Y,_,/n) 
will involve linear combinations of quantities such as 


sh At i 
1 1 
and enka etka 
The additional assumption is then made that the X, are bounded and that the 


above quantities have finite limits as n tends to infinity. 
The model of Eq. (9-47) may be written in matrix form as 


y=ZB+u (9-50) 
where 
1X ae 
Zot ae TY, 
ipsa ae 


To make Eq. (9-50) operational Yq has to be known.} The Z matrix is stochastic 
since the Y’s are stochastic, even though the X’s may be assumed to be exogenous 
and nonstochastic. However, the case is not an exact parallel of the stochastic 
data matrix considered in Sec. 7-3. There the strong assumption of full indepen- 
dence between the disturbance and the explanatory variables was valid. In this 
model it is clear from Eq. (9-47) that, while u, is independent of X, for all ¢ and all 
s and also independent of Y,_, for positive s, it is not independent of Y,, and since 
Y, in turn influences Y,,), ™, is not independent of Y,.1,¥42.---» This ap- 
parently small difference has an important effect on the estimates of Eq. (9-50). 
The underpinning assumptions for Eq. (9-50) may now be stated: 


1. E(u) = 0 and E(w’) = 9,1 
2. E(X,u,) = E(¥,-\u,) = 0 for all t 
3; 


plim( 42:2) =Se 


a symmetric positive definite matrix 


Assumption 3 follows from the stability assumption on f, and the assump- 
tion about limiting values for the second-order moments of X.¢ The Mann-Wald 


+If it is not, the effective sample size is n — 1, and the statistical inference procedures are 
conditional on ¥, with » — 1 observations, rather than conditional on Yo with n observations. 


Asymptotically, of course, it makes no difference. Pi 
+ For a complete derivation see E. Malinvaud, Statistical Methods of Econometrics, 2d edition, 


North Holland, Amsterdam, 1970, pp. 540 ff. 


362 ECONOMETRIC METHODS 


theorem can then be applied to give the results.} 


plim( *-z'u) =0 (9-51) 
and 
(424) 4. n(0, 022.,) (9-52) 
i 10, Dz. 
The OLS estimator of B in Eq. (9-50) is 
B = (ZZ) 'Zy 
=B+ (ZZ) 'Zu 
Thus 
-1 
vn (B - B) = (<z2) a (9-53) 
Using Eqs. (9-51), (9-52), and (7-24) gives 
Vn (B — B) ~ AN(0, 0, 2;.') 
a B~ AN(B, 223; '] (9-54) 


Thus even without the assumption of normality for the u’s the OLS estimators 
will be consistent and asymptotically normally distributed. The unknown variance 
matrix in Eq. (9-54) can be consistently estimated by the usual formula s?(Z/Z)'. 
If, in addition, the u’s are normally distributed, the estimators are also ML and 
efficient. These results extend simply to the general case of various lagged Y 
values and several X’s. Thus there is substantial justification for the continued 
use of OLS in relationships containing lagged dependent variables, provided the 
disturbance term is serially independent. The estimators will, however, be subject 
to finite sample bias, and one should also recall the problems of testing for 
autocorrelated disturbances in this case.¢ 


Lagged Dependent Variable and Autocorrelated Disturbances 
Suppose now that we repeat the relation (9-47) 
Yi eBich BX FRY ONE te ph Tyenantt 
as before, but the u’s are now assumed to follow an AR(1) scheme§ 
u,=pu,,+e, |p| <1 (9-55) 


where E(e)=0 and E(ee’) = oI 


+H. B. Mann and A. Wald, “On the Statistical Treatment of Linear Stochastic Difference 
Equations,” Econometrica, vol. 11, 1943, pp. 173-220, especially pp. 185-190. 

$ See Sec. 8-5. 

§ Note that this is different from the error structure in Eq. (9-11) associated with the Koyck lag; 
the latter [an MA(1) error] is considered below. 
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This new assumption has an important effect. From Eq. (9-55) it is seen that u,_; 
influences u,, but from Eq. (9-47), in period t — 1, u,_ influences Y,_,. This sets 
up a dependence between u, and Y,_, in Eq. (9-47), that is, 
E(Y,_\u,) = 0 
From Eqs. (9-47) and (9-55) it follows that} 
bd! p9,; 
ue ee 9-56 

plim( LEY.) ero (9-56) 
The consequence is that the application of OLS to Eq. (9-47) will yield incon- 
sistent estimates of all parameters. This is so because 


plim(B) =B + 2° plim( = 2:4) 


and 
lim| — Lu 
P n ‘) 0 
plim| —Zu) = lim| —LX,u = 2 
( 7 P n ot po, 


1 — Byp 
plim( +2¥,_.0,) : 


It only takes one nonzero element in plim((1/n)Z’u) in general to render all 
elements in B inconsistent. There are two main methods of obtaining consistent 
estimators in this model, namely, instrumental variables and ML. 


Instrumental Variables 
Consider 
y=ZpBt+u 
Premultiply by Z’ to give 
Ly =Vip+ Zu (9-57) 
The OLS estimator b of Chap. 5 may be obtained from this equation simply 
by setting Z’u = 0, giving 
Ly = Ub (9-58) 
On the assumption that 
plim( 42/2) =3,, and plim| +-2'u) =0 
we can divide Eqs. (9-57) and (9-58) by n, take probability limits, and equate the 
right-hand sides to find 
>. plim(b) = z..B 
so that 
plim(b) = B 
which is the standard result on the consistency of the OLS estimator. 
+ See Problem 9-5. 
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In the present model the assumption that plim((1/n)Z’u) is the zero vector 
cannot be sustained. Suppose, however, that one can find an n X k matrix W 
containing variables which are thought to be contemporaneously uncorrelated 
with the disturbance term, That is, we assume 


E(W,u,)=0  i=lepkpt=Veccyn (9-59) 


Premultiplying the model by W’ and setting W’u to the zero vector, by analogy 
with the OLS procedure, gives the instrumental variable (IV) estimator b,,, 


Wy = (W’Z)by, 
which, on the assumption that W’Z is nonsingular, may be written 
by = (W’Z) ‘Wy (9-60) 


On the further assumptions that 


plim iw2) =%,. a nonsingular matrix (9-61) 


and plim( = W'u) =0 (9-62) 


it is easy to see that 


-1 
plim(b,y) = B + plim( “-w'z} plim( Ww’) 


so that the IV estimator would be consistent. 

The variables in W are referred to as instruments. Some of them may simply 
be variables from the original Z matrix. In the present model there is no need to 
replace X, since it is already assumed to be independent of the disturbance term. 
In addition to being uncorrelated with the disturbance term, the instruments 
should not be totally uncorrelated with the explanatory variables since W’Z 
would then be a null matrix and the estimating technique would break down. If, 
in fact, W’Z is “nearly” null, the IV technique will give very poor results. 

In Eq. (9-47) we need just one instrument, and it is customary to select X,_, 
as the instrument for Y,_,. The appropriate matrices are then 


1 XX Dp Arie ¥, 
w=|! %& % Z=|1 xX% Y 
Pox xe me exer rn 
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and the IV estimator is 


aie the Bape Maaayods'|. EX 
by =| Darya Rae Gures IXY, 
LX, LX X-1 EX,\¥,-1 rX,_1Y, 


where all summations run from ¢ = 1 tot = 1.7 

IV estimators are, in general, biased in finite samples and their variances 
difficult to establish.{ It is, however, possible to derive a fairly simple and 
important asymptotic result. The result requires three assumptions. The first is 
Eq. (9-59), 

E(W,,u,)=0 — foralli,t 

that is, that the instruments are contemporaneously uncorrelated with the dis- 
turbances, The second is that the instruments possess finite probability limits for 
all second-order moments, that is, 


plim( =w'W) = 
a symmetric positive definite matrix. The third is that 
E(u)=0 and = E(w’) = 0, 1 


Because of Eq. (9-55) this last assumption is not true for model (9-47). However, 
we will ignore this complication for the moment. Under the above three assump- 
tions the Mann-Wald theorem applies so that 


eat, (oil at se 
plim( + W's) 0 


wr} 5 (0, 025,,») 


vn 
The IV estimator of Eq. (9-60) is 
by =B+ (WZ) ‘Wu 


and 


Thus 
Va (by — B) = (Lwz) '( ws) 


Should the values Xp and Yo not be available, the first row is dropped from W and X, the 
laced by n — 1 in the formula for byy- 


summations run from ¢ = 2 tot = n, and nis rep) 4 do 
Statistical Foundations and Applications, 


+ Contrast the assertion by P. J. Dhrymes, Econometrics— ae 
Harper and Row, New York, 1970, p. 297: “All IV estimators, no matter what the choice o! 


instruments, are unbiased and consistent.” This statement comes after a passage in which the only 
explicit assumptions relate to probability limits. The IV estimators are consistent. A possible 
explanation of the incorrect assertion about unbiasedness is given in App. A-8, Expectations in 
Bivariate Distributions, where the matter is discussed in detail. The same type of error can also affect 
the derivation of results about finite sample variance matrices, as in formula (6-4-12) of Dhrymes. 
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Recalling assumption (9-61) that 
Sa em ar 
plim| Ww u| =, 
a nonsingular matrix, an application of. Eq. (7-24) gives 
Vn (byy ~ B) ~ AN(0, 072,72 2y2') 


or 
by ~ AN(B,o?22./2,2%2') (9-63) 


Under the full assumptions the IV estimator would be a consistent and asymptot- 
ically normal estimator of B. The variance matrix would be estimated by the 
formula 


est var(byy) = s2(W’Z) '(W’W)(Z’W) | (9-64) 
where 
<i (y = Xbyy)(y — Xbyy) 
n-—k 
The statement in Eq. (9-63) is not strictly valid for the model (9-47) since the 
u’s are not independently distributed in consequence of Eq. (9-55). It also follows 
that Eq. (9-64) would not be the appropriate formula for estimating the sampling 
variances of the IV estimators, though it is often applied for want of anything 
better. Results (9-63) and (9-64) hold for the IV estimator under the full set of 
three assumptions outlined above, and we shall have need of them subsequently. 
The main use of the IV estimator in this model is to provide a consistent 
estimator as a starting point in an iterative ML technique. 


Maximum-Likelihood Estimator 
Combining Eq. (9-47) with the AR(1) disturbance process in Eq. (9-55) gives 
¥, = B, — Bip + B,X, — B,pX,_, + (B; + p)¥,_) — BspY,-, + & (9-65) 
If one assumes 
e ~ N(0, 071) 
then ML estimators of the B’s and p would be given by the values minimizing 
Le?. However, the first-order conditions would not yield linear equations in the 
estimators, since there are five variables in Eq. (9-65) but only four parameters to 
be estimated. An iterative Cochrane-Orcutt procedure may be used, based on two 
alternative ways of rewriting Eq. (9-65), namely, 
(¥, — p¥,-,) = 8,1 — p) + B,(X,— eX,_1) + Bs(¥,-) — pY¥,-2) + & 
(9-664) 
(¥, — B, — ByX, — Bs¥,-1) = 0(%-1 — Bi — BX, — BsY,-2) + & 
(9-66b) 
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Given a starting value for p, the transformed variables in Eq. (9-66a) could be 
computed and OLS applied to yield estimates of the B’s. These estimates in turn 
could be used to compute the transformed variables in Eq. (9-66b) and OLS 
applied to produce a revised estimate of p with the iterations continuing till 
convergence. 

Setting up the log likelihood for Eq. (9-65) and differentiating gives the 
information matrixt 


n(l—p) (1 p)EX? (1 e)ERYt) 0 0 
A EHO OAR (Pw 0 OI OO 
no2 
Bb, 1 yee re £ 0 
R| 6] =— Bp 
3 2 f 
p Me no, 0 
0, 1-? 
ici 
202 
(9-67) 
where 
Xt = X,— pX-1 and Ye) = ¥-1— PM -2 


This matrix is symmetric, and we have just shown the upper triangular portion. 
As usual it involves the unknown parameters, but it would be calculated using the 
estimated parameters. 

It has recently been shown that the iterative Cochrane-Orcutt process may 
lead to inconsistent estimates. The basic point is that, while for a finite sample 
the Cochrane-Orcutt estimators will always converge to some fixed point, that 
point may correspond to a local minimum of the sum of squares rather than the 
global minimum, and the probability limit of the fixed point will not be the true 
parameter vector. To ensure consistency of the iterative process, one must start 
with a consistent estimator. Thus starting the process by setting p to zero in Eq. 
(9-66) would be inappropriate since that corresponds to estimating the B’s by 
applying OLS directly to Eq. (9-47), which is an inconsistent procedure. On the 
other hand, the process could be started consistently by computing, say, the IV 
estimators of the B’s as described earlier. } 

An alternative approach to minimizing Le? in Eq. (9-65) is to use a grid 
search over the permissible range of p values. Thus a set of p values is specified in 
the interval (— 1, 1). Each value is used to compute the quasi first differences in 
Eq. (9-66a), and OLS is then applied to minimize Le?. A fine enough grid should 
distinguish the global minimum from any local minima. If necessary a finer grid 


+ See Problem 9-6. ’ 
+R. Betancourt and H. Kelejan, “Lagged Endogenous Variables and the Cochrane-Orcutt 


Procedure,” Econometrica, vol. 49, 1981, pp. 1073— 1078. 
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may be imposed around the p value chosen in the first grid search and a second 
grid search applied to obtain a finer estimate of the minimizing p value. This 
value and the corresponding £’s obtained from Eg. (9-66a) constitute the point 
estimates, and the asymptotic standard errors can be obtained from Eq. (9-67). 


MA(1) Disturbance 


Instead of the AR(1) disturbance process assumed in Eg. (9-55), let us now 
consider an MA(1) process. As has been shown, this is likely to occur in a simple 
Koyck scheme or in an adaptive expectations model. In each of these cases there 
is the further significant feature that the parameter of the MA(1) process is also 
the coefficient of the lagged dependent variable. The model to be considered is 
thus 


¥,=a+A¥,_, + BX, + (u,—Au,_,) JA] <1 (9-68) 
where it is assumed that 
u ~ N(0, 621) 


Utilizing the existence of the common parameter, this relation may be rewritten 
as 


Z,=a+AZ,_, + BX, (9-69) 
where 
Z,= Y,>u, 
Successive substitution for the Z variable in Eq. (9-69) gives 
Z=a(L+A4+¥4--- +27!) 
+B(X,+2X,_, +X,_2 +--+ + NIN) + ZX 
or Y,=a(L+A +++ +N!) + BX* + ZN + u, (9-70) 
where now 
Xe = X,+ AX, + PX te + NTL, 
which may be computed recursively, for any given A, as 
Xf=X,+AXE, with Xf =X, 


Relation (9-70) has a well-behaved disturbance term suitable for ML (or equiva- 
lently OLS) estimation, with Z, treated as a nuisance parameter. The data matrix 
for OLS estimation would be 


1 xed 

2 

X(A) = 1+, PvGr 
1+A4+¥ 


The appropriate procedure is then a grid search over the interval 0 < A < 1. For 
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each value of A, X(A) is computed, OLS applied to Eq. (9-70), and the set of 
parameters is chosen which minimizes the residual sum of squares. 

The asymptotic standard errors may be obtained from the information matrix 
in the usual way. The log likelihood for Eq. (9-70) may be written 


Sei (9-71) 


alt = Dinghies 
InL= 7in(27) 7 Ino, 


where 
u, = Y,— aW, — BX# — ZX 


t 
and W=1+d4---4+27! 
The unknown parameters in Eq. (9-71) are a, B, A, Zo, and o?. It may be shown 
that the expected values of the cross second-order partial derivatives involving 02 
are all zero. Thus inferences about 47 may be made independently of the other 
parameters. The ML estimator is 

a2 _ bit 

Gy =. ii 
with asymptotic variance 20,'/n. The information matrix for the remaining four 
parameters ist 


cw DW,Xt XW, DWN 


a 
me AVS DEXEN 
Bos obits ce an (9-72) 
a By? LYN 
Zo pat 
where W, and X* have already been defined and 
du, 
Mae 


=a{l+2A+---+(t- 1)N-?] + B[X-, + 2AX-1 + 
# (t= 1)N-?X,] + tZyX" 


For a penultimate problem we return to Eq. (9-27), which represents a 
combination of adaptive expectations and partial adjustment. The equation 1s 


¥, = a(1 —A,)(1—A2) + Ar + A2)¥-1 — MA2%-2 
+B(1 —A,)(1 — Az) X, + (ue = Ar,-1) 
Defining Z, = Y, — u,, this may be rewritten as 
Z, = Ay + AWE + BoX, + A2Z-1 (9-73) 
where ay = a(1 — A, )(1 — Az) 
By = BQ - AG — hy) 
Vii Vans A2Y¥,-2 


+ See Problem 9-7. 


370 ECONOMETRIC METHODS 


Successive substitution for Z and transformation back to Y gives 
Y,=a)[1+A,+---+Az'] 
+A, [Yt + oY to + AP ¥G] 
+By[X, + A2X,-) t-7 + NZX] + LZ, + u, (9-74) 


The disturbance term in Eq. (9-74) is well-behaved. The “variables” in square 
brackets are all dependent on A. Thus a grid search over 0 < ), < | and the 
choice of the error minimizing version of Eq. (9-74) will yield point estimates of 
all the parameters. 

Finally we take a look at the estimation problems of Eq. (9-14) where it was 
assumed that Y, responded to two separate Koyck lags with different parameters. 
The equation was 


Y, = p* + (a, + a2) ¥,_, — aa2¥,_, + BX, — a, BX,_, + yZ,— ayZ,_, + % 
with 
0, = Uy — (a + a) u,_) + Haz,» 
The disturbance series (v,} follows an MA(2) process. Ignoring this complication 
for the moment and assuming the v’s to be independently and identically 
distributed normal variables, the application of unrestricted OLS to Eq. (9-14) 
would not yield the ML estimators since the seven coefficients are functions of 
only five parameters. However, the relation may be rewritten as 
Yr ut BP PZ 4 0, (9-75) 
where 
YP = ¥,— (a, + @)¥_, + aa¥,_, 
Xf = X,— a, X,_ 
Zr = Z,—%Z,_, 
The transformed variables in Eq. (9-75) depend on the a, «, parameters. Given 
any pair of a,, a, values and assuming the v’s to be independently distributed, 
OLS could then be applied to Eq. (9-75) to yield estimates of p*, 8, y, and the 


residual sum of squares. The indicated estimation procedure would be a fwo- 
dimensional grid search over a,, a pairs, each parameter being constrained to the 
(0, 1) interval. 

Alternatively, if one makes the explicit assumption that the v, follow an 
MA(2) process and if u ~ N(0, 621), the variance matrix for the v’s is given by 


& 6 8 0 0 a 0 
8 8 8 5, 0 lg 0 


E(w’) = 02 8, ff & 5, iy ras 0 


Table 9-1 Lagged variable models 
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Model 


Assumption 


Estimators 


1. ¥,=p+ D(L)X, + u, 


1. ¥=p+ D(L)X, + u, 


3. Y= p+ D(L)X, + u, 


4. Y= Bi + BX, + BsY-1 + 


5. ¥ = By + B.X, + BsY-1 + Me 


6 Y=at AY, + BX, + (u, —du_1) 
7. ¥, = a(l — Ay = Ag) + AL + ADH 1 
—AA2¥,-2 + BU = Ay = A) X, 


+(u, = Agu,-1) 


8. ¥, = pt + (a + a2) M1 — Mar Y,-2 


+BX, ~ a BX,_, + yZ, 


~a,YZ,_, +, 


D(L) a polynomial in the 
lag operator: {u,} white 
noise+ 

Almon approximation 
to D(L) 

Koyck approximation 
to D(L) 

({u,) white noise 


{u,) follows AR(1) process 


{u,) white noise 
Combination of 
adaptive expectations 
and partial adjustment 
Two explanatory variables 
with separate Koyck 
lags 


OLS (ML) possibly plagued 
with imprecision due to 
collinearity 

OLS (ML) 


ML (grid search) 


OLS asymptotically 
normal and efficient 

OLS now inconsistent; 
consistent estimators 
via IV or ML 

ML (grid search) 

ML (grid search) 


ML (grid search) or GLS 
depending on treatment 
of (v,) 


{If the u, are independently and identically distributed as N(0, 2), then (u,)} is said to be a white noise 


series, 


where 


The appropriate estimation procedure for Eq. 


5 = 


5, 
5 


1+ (a, + a,)° + aay 
— (a + a,)(1 + aa) 


0 


GLS and a two-dimensional grid search over a, a. 


variance matrix in Eq. (9-76) is computed and t 


One chooses the set of parameters that minimizes the 


GLS estimates and @ is the matrix in Eq. (9-76). 
Various models have been considered in this section, and it may be helpful to 


summarize them briefly in Table 9-1. 


9-3 TIME-SERIES METHODS 


The models summarized in Table 9- u 
tions. The first model embodied the least a priori 
estimation was liable to be somewhat imprecise, whic! 


(9-75) is then a combination of 


For each a, a, pair the 


hen GLS applied to Eq. (9-75). 


weighted sum of squares 


e'2~'e, where e is the vector of residuals computed from Eq. (9-75) by using the 


1 incorporated various theoretical specifica- 


specification, but direct 


h led to the development of 
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Almon approximations. The Koyck hypothesis in model 3 is a very strong 
assumption. Models 4 to 6 are versions of adaptive expectations and partial 
adjustment depending on the treatment of the disturbance term. Model 7 is a 
combination of adaptive expectations and partial adjustment, while model 8 
incorporates two explanatory variables with separate Koyck lags. 

In recent years time-series methods of estimating a lagged relationship such 
as Eq. (9-1) 

Y¥,=n + D(L)X, + u, 


have come to be more extensively employed. As seen in Sec. 9-1, this relation may 
be formulated equivalently as Eq. (9-5), 


B(L) 
A(L) 


where D(L), B(L), and A(L) are all polynomials in the lag operator, but the 
orders of B(L) and A(L) are expected to be small relative to the order of D(L). 
The relation (9-5) is known as a transfer function in the time-series literature.t 
There are four main characteristics which distinguish time-series estimation 
methods from the various estimation procedures described in Sec. 9-2. 


¥=w+— Xx, + u, 


1. Before estimating the transfer function, the “input” series (X,} and the 
“output” series {Y,} are subjected to sufficient differencing to render both 
resultant series stationary. 

2. The orders of the A(L), B(L) polynomials are determined empirically from 
the data by an identification process and without imposing any a priori 
theoretical specifications, such as a set of declining exponential coefficients. 

3. The disturbance term in the transfer function is estimated as a general 
ARMA process, as described in Sec. 8-5, rather than as a low-order AR or 
MA process as in some of the models in Sec. 9-2. 

4, The transfer function approach has been most extensively developed for the 
single-input case (that is, one explanatory variable with various lagged values), 
and there is no firm agreement yet on the appropriate extension to cope with 
two or more inputs, each with a set of lags. 


To get a grasp of the methodology we need to discuss each of these four 
points in greater detail. 


Stationarity 


The simplest example of a stationary process is the white noise series {e,), where 
the e’s are independently and identically distributed as N(0, 02). It follows from 


+ The basic reference is G. E. P. Box and G. M. Jenkins, Time Series Analysis: Forecasting and 
Control, revised edition, Holden-Day, San Francisco, 1976, especially Chaps. 10 and 11. 
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the definition that 
E(e,)=0 forall 
var(e,) = E(e?)=02 forall 


zi 
¥, = cov(e;, &,-,) = E(ee,-,)=0  forall¢ands + 0 


1 fors =0 

an wo={o fors + 0 
Thus the mean and the variance of the series are constant, finite, and independent 
of the time subscript, as are the covariances and the autocorrelations. These 
conditions constitute a definition of second-order, or weak, stationarity. A series is 
said to be strictly stationary if the joint probability distribution of X,,..., X,, is 
the same as the joint distribution of X, ,,,.--, X;,4, for all t),..+, t,,7- Since a 
multivariate normal distribution is completely specified by the first- and second- 

order moments, the (e,} series is also strictly stationary. 
Now consider 

X, = oX,_, + & (9-17) 


where {e,) is white noise. This process may also be expressed as 
o(L)X, = (1 - o£) X, = & 


giving 
X, = &, + be) + $e,-2 atrir 
Thus 
E(X,)=0  forallt 
and 


var(X,) = E(X?)=o2(1+ e+ 94+ ---) 

This last expression only converges if || < 1. We have already seen in Sec. 8-5 
that if |¢| <1, 

a2 
1-¢ 
and the autocorrelation function is given by 

p= 
Thus Eq. (9-77) is a stationary process if || < 1. This condition is also stated 
equivalently as the root of p(L), or the zero of the polynomial p(L), lying outside 
the unit circle. This root is obtained by setting 
g(L)=1-¢L=0 

and solving for L to find L = 1/¢. Clearly, the condition || <1 implies 


|Z| > 1. 2 
If || > 1, the root of (ZL) lies inside the unit circle and Eq. (9-77) is an 


explosive series.t Now consider the in-between case where @ = 1. Relation (9-77) 


a= 


+ See Problem 9-8. 
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then defines the random walk 
X= Xt & 


Clearly, var( X,) still explodes and X, is not a stationary series. However, AX, = 
(1 — L)X, is a stationary series since it is equal to ¢,. Thus first differencing the 
random walk series produces a stationary series, but no finite number of differences 
of Eq. (9-77) can produce a stationary series if || > ie 

Extending the model to a second-order scheme gives 


X,= OX + $)X,_2 + & (9-78) 
or p(L)X, = &, (9-79) 
where (L) =1-4,L - $L? 


By analogy with the first-order case we seek conditions on the roots of p(L) 
which might distinguish between the stationary case, the explosive case, and the 
intermediate case, where differencing might produce a stationary series. The 
polynomial may be factorized as 
9(L) = (1 = ¢L)(1 - eb) 
and so the roots of the polynomial are c; ' and cy '. From Eq. (9-79) 
X= (Le, 
ae Oe eee 
(I= ¢,L)(1—eL) * 

The term 1/(1 — c,L)(1 — c,L) may be expanded in partial fractions as 
2ST SN eee: Se eee re 
(I=eL)1—eL) (1—eL)° (I=L) 

where d = c,/(c, — c2), as may be verified by multiplying out. Thus 

d 1=id 
2 a yeast Ry i 
si d(e, + cye,_, + cre ,-2 ina ) 
+ (1 —d)(e, + cye,_, + cle. +--+) 

and the variance of X, will only be finite and constant if |c,| and |c.| are both 

less than unity, that is, if the roots of p(L) lie outside the unit circle. The condition 


on the roots may be stated equivalently in terms of the ¢,, ¢, parameters of Eq. 
(9-78) ast 


Io] <1 
+4) <1 
%— oO <1 


+G. E. P. Box and G. M. Jenkins, Time Series Analysis: Forecasting and Control, revised edition, 
Holden-Day, San Francisco, 1976, p. 58. 
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If the polynomial factorizes as 
e(L) = (1-4 L)(1— L) 
Eq. (9-79) becomes 
(1 —¢,L)(1 — L)X, = (1 -— ¢,L) AX, =e, (9-80) 
Even if |c,| < 1, the X, series is nonstationary since the other root lies on the unit 


circle. However, it is clear from Eq. (9-80) that the A X, series is stationary as long 
as |c,| < 1. If a third-degree polynomial factorizes as 
o(L) = (1~ 4 L)(1- LY 

then second differencing the X, series will yield a stationary series as long as 
le| <1. 

So far we have just considered AR processes of the form p(L)X, = €,, where 
e, is white noise, and haye seen that the condition for stationarity can be 
expressed in terms of the roots of p(L). The same conditions hold when the 
disturbance of the right-hand side follows an MA scheme, for if we write 


9(L)X, = (L)e, (9-81) 
where 0(L) is a finite MA operator, 
6(L) =1-6,L—0,L? —---— 6,14 


then 6(L)e, is a stationary series. It has zero mean, a constant variance, and an 
autocorrelation function which is nonzero for the first q lags and zero thereafter. 
Thus the stationarity of the X, series still depends on the roots of p(L). The 
general form of Eq. (9-81) is 


(1 = 6h - $oL2 - +++ - GLP) - L)“X, 
= — OL — OL? — +++ — 0,L")e, (9-82) 


This is an autoregressive, integrated, moving average, ARIMA( p, d, q) scheme, 
where p is the order of the AR polynomial, d is the degree of differencing required 
to yield a stationary series (or equivalently, the number of unit roots in p(L)), 
and q is the order of the MA polynomial. The term integrated refers to the reverse 
of the differencing operation since the differenced series have to be summed (or 
integrated) to retrieve the original series. It did not arise in Sec. 8-5 where ARMA 
processes were introduced to model a disturbance series which was already 
Stationary. ; 
The general ARIMA model of Eq. (9-82) has been found to be a very flexible 
tool for the univariate modeling and forecasting of a wide variety of homogeneous 
s series that are not explosive, but which may display drift or 
apparent short-run trends as well as various irregular oscillations. The univariate 
modeling procedure consists of first determining the amount of differencing 
required to produce approximate stationarity. Typically it appears that, if 
differencing is required, first or at most second differences suffice. Defining 


x= (1 L)°%, 


nonstationary serie: 
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the second stage is making a judgment about the orders p and q in 
(1 $,L — 1? — +++ = @L?)x, = (1- 6,2 - &L? ~ --- — 6,L*)e, 


This is done by comparing the pattern of the estimated autocorrelation coefficients 
of the {x,} series with the theoretical patterns corresponding to various small 
values of p and q. An initial estimate of the $, # parameters is then derived, which 
serves as the first round in a nonlinear iterative estimation process. Finally 
various diagnostic checks are applied to the fitted model. 

Econometricians are naturally more interested in the estimation of transfer 
functions than in univariate time-series modeling. However, the latter turns out to 
be an essential component of the former. Returning to the transfer function (9-5), 
we may write it explicitly as 


(etl its 25 ala a,L’) y, = (By + BL + +++ + BLL‘) x,-4 + M 
or A(L) y= BUL) x5 + 


which is a transfer function of order (r, s, b), where b > 0 represents any delay in 
the transmission of an effect from X to Y. The X and Y series are appropriately 
differenced to achieve (near) stationarity and are also expressed as deviations 
from the sample means, if necessary.t The problem now is the determination of 
the values of r, s, and b and the estimation of the consequent a and parameters. 
The Box-Jenkins starting point is the calculation of the covariances (current and 
lagged) between x and y and the autocovariances of the x series. The solution of a 
set of simultaneous equations yields estimates of the 8 coefficients.{ From the 
resultant coefficients rough guesses are made of the values of r, s, and b on the 
basis of a comparison between the pattern of the 6 coefficients and the theoretical 
patterns for various values of r, s, and b. From the 6’s initial estimates of the a’s 
and B’s can be derived and an iterative estimation process carried out, with 
interaction between the estimation of the transfer function weights and the fitting 
of an ARIMA scheme to the disturbance term. If the original disturbance was a 
white noise series, any differencing will have produced an MA process in the 
transformed disturbances, and if the original disturbance was complicated, the 
transformed disturbance will normally be more complicated. 

Box and Jenkins also suggest that the efficiency of the above process could be 
improved if an ARIMA model was first fitted to the x, series. Denote such a 


ft Itis assumed that the same degree of differencing has been applied to each series. However, Box 
and Jenkins state, “the procedures outlined can equally well be used when different degrees of 
differencing are employed for input and output” (op. cit., ftn., p. 378). Consider 


¥,=a+ BX, + u, 
First differencing both Y and X gives 
AY, = BAX, + Au, 
so that the original f coefficient is retained while the intercept disappears. If different degrees of 


differencing are applied to each variable, one would no longer be estimating the original B coefficient. 
+ These are the coefficients of the various lagged values of X, defined earlier in Eg. (9-1). 
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model by 
,(L)x, = O(L) 1 

where 7, is approximately a white noise series. Now multiply through the model 
y= D(L)x, + uy 

by 0; '(L)@,(L). The result is 
yt = D(L)n, + % (9-83) 

where 

yt =ON(L)O(L)y, and v= 6 '(L)0(L)u, 


Equation (9-83) still preserves the 8 coefficients of the original equation, but the n, 
variable is approximately white noise and so its lagged covariances will be 
approximately zero. This leads to a considerable simplification in obtaining the 
original 6 estimates, since a series of single equations is solved rather than a set of 
simultaneous equations. The process leading to Eq. (9-83) is termed prewhitening 
the input series. The expression 6, '(L)o(L) is termed a filter, and the same 
filter is applied to both the input and the output series. 

There is an analogy between these procedures and the transformation con- 
ventionally applied in econometrics. The model (9-1) with various lagged values 
of just a single input may be written in the usual matrix form as 


y=Xd+u (9-84) 


As seen in Chap. 8, a nonspherical variance matrix for the disturbance term leads 
to GLS estimation procedures. The GLS procedure is equivalent to premultiply- 
ing Eq. (9-84) by a transformation matrix T and applying OLS to the transformed 
data Ty and TX. The matrix T is chosen according to the assumed properties of 
the disturbance term so as to make Tu a white noise series. The time-series 
approach concentrates first of all on the properties of y and X in Eq, (9-84) and 
not on the nature of u. A common differencing procedure is applied to Y, and X,, 
followed by a common filter derived from the ARIMA model fitted to X,. The 
D(L) polynomial containing the “long” series of 5 coefficients is finally repre- 
sented by the ratio of two low-order polynomials which are estimated along with 
an ARIMA model for the disturbance term. 

It is impossible to give here a detailed operational description of the time-series 
procedures.} However, it is clear that a considerable amount of “judgment” is 
required at various stages in choosing between different ARIMA and different 
transfer function models. Time-series analysts also stress that long runs of 
observations, preferably in excess of 100, are desirable, which requires the 


+ Reference should be made to G. E. P. Box and G. M. Jenkins, Time Series Analysis: Forecasting 
Francisco, 1976, or to G. W. J. Granger and P. 


and Control, revised edition, Holden-Day. San ira 
Newbold, Forecasting Economic Time Series, ‘Academic Press, New York, 1977. A jucid introduction 
to a wide range of time series topics !s provided by C. Chatfield, The Analysis of Time Series: Theory 


and Practice, Chapman and Hall, London, 1975 
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assumption that the underlying economic structure has been stable for that length 
of time. The estimation of even the single-input case is fairly complicated. The 
model is perhaps most appropriate to a “black-box” situation where interest 
centers on a single input variable which can be controlled in any desired manner 
but the researcher has no clearly articulated theory of the relation between the 
input and the output. 

The approach described above cannot be simply extended to multiple-input 
models, since the covariances between the output and any input are contaminated 
by the effects of the other inputs, unless the inputs are orthogonal. Spectral 
methods are a possibility, but are not yet well developed for this case. A 
somewhat different approach for dealing with two or more inputs has recently 
been suggested by Liu and Hanssens.} Their approach is a modification of the 
corner method for ARMA identification proposed by Beguin, Gourieroux, and 
Monfort.} Much work is proceeding in this field, and it is too soon to assess the 
likely practical significance of the methods currently under development. 

A final time-series approach that may be noted for the two-variable case is 
the prewhitening of both series. Letting y, and x, denote appropriately differenced 
series as usual, a separate ARMA model is fitted to each series, denoted by 


@,(L)y, = 6,(L)a,, 


and $(L)x, = 6(L) ay, 


(9-85) 


where @,, and ,, denote estimated residuals which are approximately white noise 
series. Thus y, is prewhitened by the filter 6, (L)$,(L) to yield @,,, and x, is 
prewhitened by its filter to yield @,,. It is argued that this approach is useful in 
cases where there is doubt about the direction of causation. Does x cause y so that 
one expects nonzero correlations between y and earlier values of x, or is it the 
other way around, or is there joint causation and feedback? The suggested 
procedure is to compute the cross correlations at various lags, positive and 
negative, between @,, and #,,. Inspection of these correlations should lead to a 
decision about causation. For example, if causation is thought to run from x to ), 
a transfer function model is estimated for a, On i,,, Say, 


a,, = P(L)a,, + noise (9-86) 
where the parameters of this transfer function are indicated by 
T(L)=%»+ yb + yl? +--- 
to emphasize that they are nof the original structural coefficients D(L) connecting 


+L. M. Liu and D. M. Hanssens, “Identification of Muitiple-Input Transfer Function Models,” 
Communications in Statistics, 1982, 

+J. M. Beguin, C. Gourieroux, and A. Monfort, “Identification of a Mixed Autoregressive-Moving 
Average Process: The Corner Method,” in O. D. Anderson, Ed., Time Series Analysis, North-Holland, 
Amsterdam, 1980. 
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y and x. Finally substituting for a,, and a, from Eq. (9-85) gives 
6; (L)8,(L) y, = P(L)6 (LZ) 4,(L)x, + noise (9-87) 


which is a relation connecting y and x. The resultant estimates of the structural 


coefficients are obtained by equating coefficients of the powers of L in 
D(L) = 4,(L)$; (L)P(L) 4. (L)$.(L) (9-88) 


This is a rather different procedure than just prewhitening the input, applying 
the same filter to the output, and then estimating D(L) directly as in Eq. (9-83). 
In principle the D(L) polynomial is recoverable and capable of being estimated 
by either method. For example, suppose y;, and x, simply follow different AR(1) 
processes, 


(l—aL)y= uy, and (1 aL.) x, = Hye 
Substituting for y, and x, in y = D(L)x, gives x 
uy, = (1 = aL) D(L)(L = aL)" 


which is the implied transfer function between the separate white noise processes. 
Substituting now for u,, and u,, gives 


(1 — aL) y, = (1 — @L)D(L)O — a,L) ‘(1 — a,L)x, 


which gets us back to 
y, = D(L)x, 


However, the above is in terms of the true coefficients and has also ignored the 
noise terms. In practice the bivariate prewhitening approach requires the estima- 
tion of more parameters and greater manipulations of those estimated parameters 
than does the univariate prewhitening method. It would be interesting to see 
comparative case studies of the results yielded by the two approaches, but there 
do not yet appear to be any. An extensive application of the bivariate pre- 
whitening approach to various time series of money and interest rates yielded “a 
surprising, probably disconcerting, lack of relationship among several variables.”} 
Pierce’s main conclusion was, “ Extensions of time series modeling procedures of 
Box and Jenkins reveal that numerous economic variables which are generally 
regarded as being strongly interrelated may with equal validity, based on recent 
empirical evidence, be regarded as independent or only weakly related.” A further 
study by Haugh and Box illustrated the same approach to the study of the 
connection between the GNP X and the unemployment rate Y in the United 
Kingdom.¢ Each series was first differenced to yield x, = X,— X,-, and y= 


Thereof—Between Economic Time Series, with 


+D. A. Pierce, “Relationships—and the Lack : 
e American Statistical Association, vol. 


Special Reference to Money and Interest Rates,” Journal of th 
72, 1977, pp. 11-22. 

+L. D. Haugh and G. E. P. Box, 
Connecting Two Time Series,” Journal of the American 
121-130. 


“Jdentification of Dynamic Regression (Distributed Lag) Models 
Statistical Association, vol. 72, 1977, pp. 
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Table 9-3 Summary statistics for rz; 


Number of coefficients exceeding Meanabsolute 
Two standard errors One standard error coefficient 
Negative lags 2 7 0.13 
Positive lags 2 8 0.14 


Y, — ¥,_;. The separate ARMA models were estimated from 56 deseasonalized 


quarterly observations as 
(1 — 0.63L) y, = a, 
and x, = 0.66 + t,, 


The first concern was the direction of causation. Table 9-2 shows the various 
lagged cross correlations. A positive lag is here defined as the y series lagging 
behind the x series. The asymptotic standard error for r is 0.13. Table 9-3 presents 
three summary statistics computed by the author from the data in Table 9-2. The 
data in these two tables hardly seem to give any clear indication of the direction 
of causation. However the authors of the paper state, “Tt is concluded that any 
feedback effect is of secondary importance, as evidenced by the small cross 
correlations at negative lags... . This direction of causation from x to y agrees 
with that considered by Bray.”} Another time series analyst might well interpret 
these cross correlations differently and fit a different transfer function to the 
residuals. 

There is as yet no clear consensus on the relative roles of time-series 
techniques and the more orthodox econometric methods. Some mistakenly view 
them as competitive rather than complementary. Each is still an “art,” as distinct 
from a “science,” in that time-series practitioners have to make various subjective 
judgments in the course of their analyses just as econometricians conventionally 
“choose” between different regressions and specifications. Investigators with 
strong prior beliefs can usually see “patterns” in the data that may be invisible to 


more sceptical colleagues.¢ 


PROBLEMS 


9-1 Deduce the 8 coefficients implied for D(L) = B(L)/A(L) where 
A(L) = 1 = aL ~ aL? 
B(L) = Bo + BL 


Derive an expression for the mean lag in this process. 


+ Haugh and G. E. P. Box, op. cit., p. 127. ; 

+ Years ago in Ireland “reading the tea cups” was a favorite social pastime before the advent of 
the ubiquitous tea bag. On draining the tea cup the haphazard pattern of the remaining leaves could 
be interpreted by the skilled “ reader” as full of meaning and significance. Nowadays a different class 


of professionals apply similar “kills” to the interpretation of computer printouts. 
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9-2 A model is specified as 
y= Or oie, wel = | 
u, = & + ae,_) 
with 
e~ N(0, 021) 


The 8 parameter is estimated by 8 = LY,Y,_ )/LY2 |. Show: 


Feb ge(h= O>) a 
(a) plimd =6 + 14286 where Pas 
(b) plim( 5a?) = oF) + «(a — a*)] 


where 
o(l- 8) 
1 + 286 
9-3 Show that a second-degree approximation for the Almon lag implies the restrictions 
8, — 38, + 38, — 6) =0 
5, — 38; + 36, — 8, = 0 


3=0 


a= and i} = Y, - 8Y,_, 


- 
and hence verify the R, matrix shown in Eq. (9-37). 
9-4 Verify that the information matrix for the model of Eq. (9-44) is given by Eq. (9-46). 
9-5 Prove Eq. (9-56). [ Hint: Use the lag operator to express Eq. (9-47) as 

Y, = constant + B,(X, + BsX,) + BRX,g +0 ) + (u, + Bsuj-) + Beuj.. +---)) 
9-6 Derive the information matrix (9-67). [Hint: Write e, in the alternative forms 

e, = ¥* — B\(1 — p) ~ BX? — BY", 
and 
&, = Uy ~ pu,_, 
where 
Wa %— ph) X= X,— PX, — u,= Y,— By — BX, - BY, 


and then find all the second-order partial derivatives of 
n n | 
In L = ~FIn(2m) — Ino? - Jgibtt 


9-7 Derive the information matrix (9-72) and show also that in this model the estimator of a; is 
asymptotically independent of the remaining estimators. 

9-8 Consider X, = 2X,_, + e, where (e,) is a white noise series. Draw some sets of e's from a table of 
random normal deviates and compute the corresponding sample realizations of the process for 
eo tera 10, starting each realization off by setting X) = 0. Satisfy yourself that X, can become 
“very large” in both positive and negative directions, 

9-9 If u,=(1-O,L ~ 6,1? —... = 6,L4)e, and {e,) is white noise, derive the autocorrelation 
function of the {u,) series. 

9-10 In the rational lag equation 


y 3L k. 
Jie oy 
“-=09b4+ 0a 


determine: 
(a) The total multiplier 
(b) The mean lag 


(c) The coefficients of x,_, for = 0,1,2,3. (UL, 1980) 
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9-11 The model generating the {y,) series is presumed to be 
Y= AY, + Uy Ja) <1 
with 
U, = pu,_1 + & lp] <1 
and the e’s independently and identically distributed with zero mean and constant variance o?. Show 
that 


~_ Lfazitrtt 
hae ee 
Dera 


where w, = y, — &);—, and & is the OLS estimator of a, is not a consistent estimator of p. 


(UL, 1973) 


9-12 A simple stochastic version of the permanent income model is 
2,6 xX,+ 4, 


where z, is observed income, x, is permanent income which is unobserved, and u, is a serially random 
transitory element. Let x, evolve according to x, = x,-; + 0- Assume u and v are independent. 

(a) What constraints does this model place on the autocorrelation function of (z, — 2,1)? 

(b) Discuss how you would estimate o2 and o, from data on z. 

(c) What would positive sample autocorrelation at lag 1 in z,—2,-, Suggest about the 
plausibility of the model? 

Suppose now that u and v are not assumed to be independent but have covariance 9,,,: 

(d) Are the parameters o?, 0,, and Oy. identified? 

(e) What would positive autocorrelation in 2, — 2,~1 
magnitude of 02 relative to 0,? 


imply about the sign of 0,” About the 


(University of Washington, 1980) 


CHAPTER 


TEN 
A SMORGASBORD OF FURTHER TOPICS 


Chaps. 5 to 9 have presented the “standard fare” of the single-equation linear 
model. This chapter outlines a number of additional topics, some of which are 
“golden oldies” that have been around for some time, while others have come into 
prominence more recently. Some enthusiasts may wish to study all the topics; 
other readers may be interested in some topics but not in others. To a large degree 
the sections stand alone and can be read independently. 


10-1 RECURSIVE RESIDUALS 


As shown in Chap. 5, the vector of OLS residuals is given by 


e=Mu 
where 


M =1— X(X’x) 'x’ 
which is a symmetric idempotent matrix of rank n — k. If the u’s are indepen- 
dently and identically distributed, it then follows that 
E(ee’) = 62M 


Thus the calculated residuals will, in general, display heteroscedasticity and 
nonzero covariances, even when homoscedasticity and zero covariances hold for 
the true disturbances. This leads to the difficulties in testing for heteroscedasticity 
and autocorrelation already discussed in Secs. 8-4 and 8-5. 
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Recursive residuals are a set of residuals which, if the disturbances are 
independently and identically distributed, will themselves be independently and 
identically distributed, thus greatly facilitating tests of the null hypothesis.j We 
postulate the usual linear model 


y=Xbt+u 
with u~ N(0, 071) 


and X a nonstochastic matrix of order n X k. Let x, denote the k x 1 vector of 
observations on the k explanatory variables at sample point j.f Thus 


(10-1) 


, 
ere rete 
, 
Swe Al 
, 


x 


Let X,_, denote the (r — 1) X k matrix consisting of the first r — 1 rows of X. 
Provided r — 1 > k, this matrix may be used to estimate B. Denote the resultant 
estimator by b,_,, that is, 
, —ly, 
Lew oe (X,Xp21) Xd 
where y,_, denotes the subvector consisting of the first r — 1 elements of y. Using 
b,_, one may “forecast” y, at sample point r, corresponding to the vector x, of 
explanatory variables at that point. The forecast error is 
I, — Xb, - 
and, as shown in Sec. 5-4, the variance of this forecast error is 
o*(1 an x(X1_1X,-1)) ¥,) 
Define the recursive residual w, as 
ih Ye — Xba (10-2) 
(1+ x, (X)X,1) %,) 


Clearly, under assumption (10-1) 
w, ~ N(0, 0”) 


since it is a linear function of normal variables and the OLS forecast is unbiased. 
A sequence of recursive residuals may be generated as follows. 


For the moment let this be the first k 


1. Choose a base of k observations. : : 
+ it be composed of time-series or Cross- 


observations in the sample, whethe! 


eral class of LUS residuals (linear unbiased with a 


Recursive residuals are a member of the gen : 
tie oe the BLUS residuals due to Theil. See H. Theil, 


scalar variance matrix). Another important set is 
Principles of Econometrics, Wiley, New York, 1971, Chap. 5. : sie. 
+As is customary, the first element in each x vector will be unity to accommodate the intercept 


term. 
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section data. Compute the vector b, and the recursive residual 
Vier — XK 41D, 


{1 + MGS) Tee) 


Wert = 


2. Update, or extend, the base to include the first k + 1 observations; compute 
b,,, and hence w, , 5. 
3. Repeat step 2, adding one new observation point at a time. 


There is thus a sequence of n — k recursive residuals as defined in Eq. (10-2) 
for r = k + 1,..., m. The practical importance of recursive residuals is due to the 
fact that, under assumption (10-1), the vector of residuals defined by Eq. (10-2) is 
multivariate normal with zero mean vector and scalar variance matrix, that is, 


w~ N(0,07I,_,) (10-3) 


Since we have already seen that each w, is normal with zero mean and variance 
a”, the proof of Eq. (10-3) just requires the establishment of zero covariances. The 
numerator in Eq. (10-2) may be written 


5 aS arp zias br a a, at: 
Thus 


E{(y, — x7b,_,)(», — x,b,_,)} 
= E{[u, =a 6, a eet) a, u,][u, eee) Xi, 31) 
(10-4) 


We may assume that r < s without any loss of generality. Thus 


E(u,u,) =0 
E(uyu,) 
E(u 
tageju, te EEA hg 
E(u,_u,) 
E(um,\)=[0 ++ 0 07 0. 9 
t 
rth position 
uy 
us 
E(u,_w,_,)=£ ‘ [u, uy, u,_\u, Fay | 
as 


o7 (I, 1 0, an) 
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Multiplying out the right-hand side of Eq. (10-4), remembering that the expecta- 
tion of a scalar can also be written as the expectation of the transpose of the 
scalar, and using the above results, easily establishes 


E(ww,)=0  forallr,s;r*s 
and so Eq. (10-3) is proved. . 

The computation of the recursive residuals might be achieved by using the 
conventional OLS formula repeatedly to compute each b vector in the sequence 
b,,b,, ,--» b,. However, the calculations are simplified by using the following 
recursion formulas:} 

, eal “x +) 
(X,-1X,-1) Sapiro) (10-5) 
1+x/,(X,_)X,-1) X, 


(xex)) = OAKES 


and path) ERE) xe xb) (10-6) 


Since 


it follows that 
XX, = X/1X,-1 + XX, 
Eq. (10-5) may then be checked by multiplying the left-hand side by X/-X,, the 
right-hand side by X/" ,X,_1 + XX’ and seeing that both reduce to the identity 
matrix.f Relation (10-6) may be simply derived since 
(X;X,)b, = Xiy 
= X/-19r-1 + XrJr 
= (X/_)X,-1)b,-1 + Xd, 
= (XOX be ee x'b,_) 
Finally, relations (10-5) and (10-6) may be used to derive the following: § 
RSS, = RSS,_, + r=k+tl.en (10-7) 


where 
RSS, = (y, ~ X,b,)'(9, ~ X-b,) 
ve residuals have a number of important 


These theoretical results on recursi : ber 
rovide an alternative derivation of the test 


practical applications. First of all they p! 


+See R. L. Brown, J. Durbin, and J. M. Evans, “Techniques for Testing the Constancy of 
Regression Relationships over Time,” Journal of the Royal Statistical Society, ser. B, vol. 37, 1975, pp. 
149-192, for a statement of these formulas and some notes on their history. A useful survey of 
recursion formulas for various models is to be found in W. C. Riddell, “Recursive Estimation 
Algorithms for Economic Research,” Annals of Economic and Social Measurement, Vol. 4, 1975, pp- 
397-406. 

+ See Problems 10-1 and 10-2. 

§ See Problem 10-3. 
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for structural change in the case where the second sample contains fewer than k 
observations.} Based only on a heuristic proof, it was asserted in Eq. (6-27) that 
under the hypothesis of no structural change 


o (exes — eje,)/n2 

eye, /(n, — k) 
where e4e, denotes the residual sum of squares from a regression fitted to all 
n, + n, observations and e'e, is the residual sum of squares from a regression 


fitted to the first n, observations. From Eq. (10-7) it follows that for a regression 
with n observations, 


~ F(nz,n,—k) 


n 
RSS,= Yo w? 
r=k+1 


since RSS, = 0, as a regression with k parameters fitted to k observation points 
will have zero residuals, Thus 


be! | 
G =a 2 
ee = L wy, 
r=k+l 
nytnz 
ee,= > w? 
r=k+1 


and so the F statistic defined above becomes 
Cente wen 
Tikes We /(m, — k) 

Since under the null hypothesis the w, are independently and identically distrib- 
uted normal variables, the F statistic is seen to be the ratio of two independent x? 
variables, each divided by the appropriate number of degrees of freedom, and so 
it has the F(n,, n, — k) distribution. 

A second useful application of recursive residuals lies in testing for hetero- 


scedasticity.+ If the alternative hypothesis to homoscedasticity is that 0? varies 
with X;,, the procedure would be as follows. 


1. Order the data according to the values of X; and choose a base of at least k 
points from among the central observations. 

2. From that base compute a vector w, of recursive residuals corresponding to 
the first m observations, and another vector w, of recursive residuals corre- 
sponding to the last m observations.§ Since the smallest feasible base is of size 
k, the maximum value of m is (n — k)/2. 


+ See A. C. Harvey, “An Alternative Proof and Generalization of a Test for Structural C hange,” 
The American Statistician, vol. 30, 1976, pp. 122-123. 

4A. C. Harvey and G. D. A. Phillips, “A Comparison of the Power of Some Tests for 
Heteroscedasticity in the General Linear Model,” Journal of Econometrics, vol. 2, 1974, pp. 307-316. 

§ Notice that there is no problem in computing recursive residuals backward or forward in a 
sample from any suitably chosen base, or indeed in adding “ new” observations in any order. 
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3. Under the null hypothesis it follows directly from the properties of recursive 
residuals that the test statistic 


_ 
pe ant F(m,m) (10-8) 


Some sampling experiments by Harvey and Phillips indicate that the power of the 
test in Eq. (10-8) compares favorably with that of the Goldfeld-Quandt test 
described in Sec, 8-4, They recommend setting m at approximately n /3. An 
advantage of the recursive residuals test over that of Goldfeld and Quandt is the 
greater flexibility of the former. If, for example, one now wished to test whether 
o2 varies with some other variable X,, one could simply regroup the existing 
recursive residuals according to low and high values of X, and compute Eq. (10-8) 
afresh, whereas the Goldfeld-Quandt test would require the computation of two 
new regressions. 

A third application of recursive residuals is in testing for autocorrelation. } In 
a time-series application one may take the first k observations as the base. From 
the resultant n — k recursive residuals the conventional von Neumann ratio ist 


82 _ Coeeva( me Was) Am = k= 1) 
°° rrp (w,— #)/(n- k) 


where W = D"_,, \w,/(n — k). This is the ratio of the mean-square successive 
difference to the variance. An exact test against serial correlation would be 
provided by referring the calculated value of 8?/s? to the significance points of 
the von Neumann ratio.§ These critical values, however, were derived for the 
general case where the expected value of the series being tested is some unknown 
constant. In this application the w’s are known to have zero mean. Incorporating 
this information, Press and Brooks have computed significance points for a 
modified von Neumann ratio] 


(2 )_— Beal nhl 1) (10-9) 
e Dh we/(n — k) 


Ss 
These points are tabulated in App. B-7. The von Neumann ratio is arithmetically 
closely related to the Durbin-Watson statistic, which could, of course, be com- 
puted from the recursive residuals. The crucial point, however, is that the 
multivariate normal distribution for w specified in Eq. (10-3) satisfies the assump- 
tions underlying the derivation of the von Neumann (Press and Brooks) signifi- 


+G. D. A. Phillips and A. C. Harvey, “A Simple Test for Serial Correlation in Regression 
Analysis,” Journal of the American Statistical Association, vol. 69, 1974, pp- 935-939. q 
£J. von Neumann, “Distribution of the Ratio of the Mean Square Successive Difference to the 


Variance,” Annals of Mathematical Statistics, vol. 12, 1941, pp. 367-395. ; , 

§B. I. Hart, “Significance Levels for the Ratio of the Mean Square Successive Difference to the 
Variance,” Annals of Mathematical Statistics, vol. 13, 1942, pp. 445-447. gate 

4S. J. Press and R. B. Brooks, “Testing for Serial Correlation in Regression, Report no. 6911, 
Center for Mathematical Studies in Business and Economics, University of Chicago, Chicago, 1969. 
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cance points so that an exact test is available, thus avoiding the inconclusive zone 
associated with the Durbin-Watson statistic calculated from the OLS residuals. 
Some sampling experiments by Phillips and Harvey suggest that the power of this 
test may be increased by forming the initial base from a mixture of the first and 
last observations. 

Fourth, recursive residuals provide a test of some possible forms of 
misspecification.+ Since, under the null hypothesis, the recursive residuals are 
independently and identically distributed normal variables with zero expectation, 
the mean of the residuals divided by its estimated standard error will follow a t 
distribution. Formally 


ww 
= ee et (= 10-10) 
Pye Apo 
where 
a hake) 
Naat BCE 
and 
ote tonne OH = y 
n—k— 1 


As an illustration of the use of this test in specification analysis suppose the 
postulated model is a linear relation between Y and_X, If the true relation is 
convex (concave) and the data are ordered by the size of X, the recursive residuals 
would be expected to be mainly positive (negative) and the computed 1 statistic 
will tend to be large in absolute value. In a multivariate situation this specification 
test could still be carried out for any single explanatory variable, if it were thought 
that the other explanatory variables were correctly specified, but this type of a 
priori knowledge is seldom available. Several specification errors might have a 
self-canceling effect on the recursive residuals, so this test is not likely to be very 
effective in multivariate situations. 

Finally Brown, Durbin, and Evans describet an important application of 
recursive residuals in testing for structural change over time. The null hypothesis 
of no structural change for the model y = XB + wis specified as 


Hy: B= B= - =B,=8 


ea eviare Gi 
Oj =O; = ++: = G7? = 


where B, denotes the vector of coefficients tuling in period ¢ and 0, the dis- 
turbance variance in that period. It is clear that the null hypothesis would be 
violated if the B vectors remained constant but 62 varies. This would be the classic 


+A. C. Harvey and P. Collier, “Testing for Functional Misspecification in Regression Analysis,” 
Journal of Econometrics, vol 6, 1977, pp. 103-119. 

#R. L. Brown, J. Durbin, and J. M. Evans, “Techniques for Testing the Constancy of Regression 
Relationships over Time,” Journal of the Royal Statistical Society, ser. B, vol. 37, 1975, pp. 149-192. 
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case of heteroscedasticity, which might be tested by some of the procedures 
already outlined. The main concern in problems of structural change, however, is 
variation in the B’s. 

The authors suggest a pair of tests, namely, the cusum test and the cusum of 
squares test. The first test statistic is the cusum quantity 


W,= Lw/6 r=kt+1,....7 (10-11) 


where 


W, is seen to be a cumulative sum, and it should be plotted against r. As long as 
the B vectors are constant, E(W,) = 0, but if the B’s change W,, will tend to 
diverge from the zero mean value line. For a forward recursion the significance of 
the departure of W, from the zero line may be assessed by reference to a pair of 
straight lines which pass through the points 
{k, tayn — k) and {n, +3aVn — k} 

where a is a parameter depending on the significance level « chosen for the test. 
The correspondence for some conventional significance levels is 

a=0.01 a= 1.143 

a = 0.05 a = 0.948 

a=0.10 a=0,850 


The lines are shown in Fig. 10-1. : 
The equation of the upper line in Fig, 10-1 may be determined from 


W, — ayn —k _ avn —k 
1—k ae eis. 


2aJn ~ k 


Figure 10-1 Cusum plot. 
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or 


and the equation of the lower line is given by its negative. 
The second test statistic is based on cumulative sums of the squared residuals, 
namely, 
r 2 
kai, 
SETS Safe) PAE 5 (10-12) 


5, = 
2 
he Wy 


The mean value line giving the expected value of the test statistic under the null 
hypothesis is 

Pict 

n—-k 

which goes from zero at r = k to unity at r = n. The significance of the departure 
of s, from its expected value may be assessed by reference to a pair of lines drawn 
parallel to the E(s,) line at a distance cy above and below. Values of cy for 
various sample sizes and levels of significance are tabulated in App. B-8. Refer- 
ence should be made to the Brown, Durbin, and Evans article for practical 
illustrations of the technique and for interpretations of various plots. The basic 
idea is that instability of the parameters would be indicated if the plot of W, or s, 
crossed the significance lines described above. There is some evidence that the 
cusum test is less powerful than the cusum of squares test. Some Monte Carlo 
experiments by Garbade also suggest that the latter may not be very powerful in 
comparison with tests based on variable parameter models.} However, the 
explanatory variable in his experiments was random over time, and it would be 
interesting to see if the same result was obtained with an autoregressive explana- 
tory variable. 


E(s,) = 


10-2 SPLINE FUNCTIONS 


In an interesting study Poirier and Garber examined the determinants of profit 
rates in the aerospace industry over the period 1951-1971.t They were particu- 
larly interested in the behavior of profit rates, ceteris paribus, in three distinct 
periods, 1951-1954 (Korean war), 1954-1965 (peace), and 1965-1971 (Vietnam 
war). To cover the ceteris paribus proviso, they included eleven explanatory 
variables, apart from time, in their equation. They treated time by means of spline 
functions, and to illustrate the basic idea we will assume that the profit rate has 
been adjusted for the effects of the eleven variables and look at the behavior of 
the net, or adjusted, profit rate over time. Assuming a linear time trend, the 


+ K. Garbade, “Two Methods for Examining the Stability of Regression Coefficients,” Journal of 
the American Statistical Association, vol. 72, 1977, pp. 54-63. 

See D. J. Poirier and S. G. Garber, “The Determinants of Aerospace Profit Rates, 1951-1971,” 
Southern Economic Journal, vol. 41, 1974, pp. 228-238; or D. J. Poirier, The Econometrics of Structural 
Change, North-Holland, Amsterdam, 1974, Chap. 2. 


A SMORGASBORD OF FURTHER TOPICS 393 


postulated model would be 


Period 1 y,= a, + Bitty, t<a 
Period 2 y= 0, + Bt + u, a<t<b (10-13) 
Period 3 y, = a; + Bit + u, b<t 


In this example we might take the origin of time to be 1950. Measuring in years 
then gives a = 4 (.e., 1954) and b = 15 (1965). The data might be split into three 
distinct subsets and three separate time trends estimated, The result, in general, 
would look like Fig. 10-2a. There is nothing in the unrestricted estimation process 
to ensure that the functions meet at the join points f= a and 1 = b, Fig, 10-26 
illustrates a linear spline, or piecewise linear, function, which eliminates instanta- 
neous jumps or discontinuities in the function at the join points or knots. 
The linear spline function may be fitted in two alternative fashions. One is to 
define the following variables: 
wy, =t 
oh): ie ift<a 
a t-—a ifa<t 
tebe ee ifr<b 
sili se ifb <1 


and reparameterize the function as 


y, = a + Bw, + Bar, + 533, + Uy (10-14) 
Comparing Egs. (10-13) and (10-14) it is easy to see that 
B, = 3, 
By = 6, + 8, a, = a, — 64 (10-15) 
B, = 8, + 8, + 45 a; = a — 535 
y » 
4 4 
oO ; =~! oO E b eae 
a 


(a) (b) 


Figure 10-2 
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Fitting Eq. (10-14) directly by OLS will yield estimated functions which meet at 
the knots, and the estimated a and 8 parameters of those functions can be 
determined from Egs. (10-15). Tests on a’s and B’s imply equivalent tests on the 
6's. Thus testing the significance of 5,(= B,) is asking whether there is a positive 
(or negative) trend in the first period. Testing the significance of 6, is asking 
whether the trend slope in the second period differs significantly from that in the 
first, and similarly, testing the significance of 6, amounts to asking whether the 
trend slope in the third period differs from that in the second. Setting up the null 


hypothesis 
n. [®]_[° 
AS Nad 


is equivalent to postulating that the B’s and the a’s are the same in all three 
periods, that is, that the data may be adequately described by a single linear 
trend. This test may be carried out most simply by fitting 


y= at dw, + u, 


as the restricted model, the full spline function (10-14) as the unrestricted model, 
and calculating the test statistic defined in Eq. (6-8). 

An alternative estimation procedure is restricted least squares. Returning to 
Eqs. (10-13), the restrictions implied by the join points are 


a, + Bia =a,+ Ba 
a + Bb =a; + Bb 
which may be set up in the conventional framework as 


R B r 
a 
B, 
lieder. =a 0 °| a2 -(°] 
[3 Cnet gt aia, | 0 ary 
a 
B; 
Thus the model 
ria’; 
i bie Bai) 1 a, 
Of NS ! 
¥ Cee ae jet ass B, 
2 teal aFullinl a 
»|= | ta+2! 7-3) ug (10-17) 
Ss | eb eee a 1 2 
¥3 a eel eee 1 a, 
igs ea BOON coe are F 
i eae pee a NBs 
' 11 b+2 
1 
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where the empty cells in the data matrix are all zero, is fitted subject to the 
restrictions in Eq. (10-16). The appropriate formula is given in Eq. (6-5). The 
estimates of the a and £ parameters will be identical to those derived from the 
estimated coefficients of the spline function in Eq. (10-14).7 

This simplified example used time as an explanatory variable. The procedure 
works equally well for any explanatory variable x with known join points, or 
knots, at x,,X,, and so on. A possible disadvantage of the linear spline is that 
while the function itself is continuous at the knots, there is a discontinuity or 
jump in the first derivative. This may be overcome by the introduction of 
quadratic or cubic splines. To illustrate a cubic spline function, suppose we have a 
two-variable relation with known knots at x, and x,. Within each subset y is 


expressed as a third-degree polynomial in x, namely, 
y =a, + Byx + Bax? + Byer tu i= Aj2}3 (10-18) 
where the subsets are defined by 
i=1 XSX, 
$2256 XS S% 
i=3  X,<% 


The restrictions implied by continuity at the knots are then 


ay, + Bix + Bi2x2 + BisXa = 1 + BuXa + Baxi + Baxi 
a, + Box + Brox} + Brsx = @01 + Brixp + Byrxh + BysXi, 
We further impose continuity of the first derivatives of the cubic spline function, 
which implies 
By + 2Bi2%a + 3B 3x2 = Bu + 2BnXa t 3B 3X0 
Boy + 2Br2Xp + 3Box3 = Bu + 2Bs2%» + 3Brsxh 
In addition, continuity of the second derivatives implies 
2By, + BisXe = 2B t 6By3%Xa 
2By2 + 6B23X» = 2B32 + Bs3Xp 


i ly allows discontinuities in the third derivatives at the join 
ages be estimated by fitting Eq. (10-18) and 
‘0 the six restrictions set out above.t 


The cubic sp! , 
points. Thus the cubic spline may 
estimating the twelve parameters subject t 


al Case of Restricted Least Squares,” Journal 
64-72, which develops the restricted least 
ative estimation procedure. 


+ See Problem 10-4. 

+See A. Buse and L. Lim, 
of the American Statistical Association, V¢ 
squares approach; and D. J. Poirier, op- cit: 


“Cubic Splines as 4 Speci: 
vol. 72, 1977, PP: 
, Chap. 3, for an alterni 
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Both previous examples have been in terms of spline functions on a single 
explanatory variable. There are now several examples of applications of bilinear 
splines, where linear splines are specified for two variables with main effects and 
interaction effects at a two-dimensional grid of specified knots. 


10-3 POOLING OF TIME-SERIES AND CROSS-SECTION DATA 


In many problems the investigator may have access to observations on the 
behavior of a “panel” of decision units at a number of different (and usually 
Successive) time periods. We will assume there are p distinct decision units or 
groups indexed by i = 1,..., p and m successive time periods indexed by ¢ = 


1,..., m, giving a total of n = pm sample points. The variables are denoted by 
Y,, = value of the dependent variable for unit i in period 1 
i= 1,...,p;t=1,....m 
X,,, = value of jth explanatory variable for unitiin periodt j = 2,...,k 


The linear hypothesis would then be 
Yi, = @ + BX zip + ByXyi, + + + BX gy + Uy (10-19) 


where, for the moment, we assume a common set of parameters for all units in all 
time periods. To illustrate some of the many possible applications of the model 
consider some examples. 


1, The panel consists of, say, 1000 households whose savings behavior Y,, is 
monitored along with various explanatory variables Xj;,, Such as income, 
family size, and composition over a number of time periods. 

2. The panel consists of a set of firms, and the object of study is the size and 
timing of their investment expenditures Y,, as a function of the group of 
explanatory variables thought to influence investment. 

3. The panel might consist of the 50 states of the United States, and the focus of 
investigation are the determinants of the unemployment rate Y,, across states 
and over time. 

4. The panel consists of the OECD countries, and Y,, indicates the per capita 
consumption of gasoline in country i in year r. The relevant question is 
whether the usual economic variables such as income and relative prices can 
adequately explain the variation in Y,,. 


The most common way of organizing the data in Eq. (10-19) is by decision 
units. Thus let 


y= N Kym |i ences gente S-< eee eee a = 


+See D. J. Poirier, The Econometrics of Structural Change, North-Holland, Amsterdam, 1974, 
Chap. 4. 
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denote the data and the disturbances relevant to the ith unit. The data may be 
“stacked” to form 
v x, wy 
y=|: Kony a=|: (10-20) 
i x te 2 


where y isn X 1, Xisn X (k — 1), and wis n X 1. The model in Eq. (10-19) may 
be expressed as 


y=[i xi[p] +4 (10-21) 


where i is an n X 1 vector of units, a is a scalar, andB=(B, B; °-* Bx)’ 

A variety of models has been proposed for time-series and cross-section data, 
and most have been fitted to some data set or another. These models may all be 
derived from Eq. (10-21) by varying the assumptions made about the systematic 
part of the equation and/or the assumptions made about the disturbance vector. 
A possible taxonomy of models is indicated in Table 10-1. The meaning of 
various terms in the table may not be clear at first sight but will become so as the 
models are explained. 

Model I(a) is perfectly straightforward. The systematic part of Eq. (10-21) 
postulates a common intercept and a common set of slope coefficients for all units 
at all time periods. The disturbance assumption is 

uj, ~ iid(0,0,;) for alli, 4 


where iid means independently and identically distributed. Thus there is no serial 
correlation in the disturbances for any individual unit, there is no dependence 
either contemporaneous or lagged, 


between the disturbances for different units, 
and the disturbance has a constant variance at all points. The appropriate 
estimation method is OLS applied to the stacked data of Eqs. (10-20). Tf, in 
addition, the u,, are assumed to be normally distributed, all the finite sample 
inference procedures of Chaps. 5 and 6 are valid. 

Model I(b) allows a richer specification for the disturbance term. There are, 
in fact, several versions of model I(b) depending upon the precise assumptions 


Table 10-1 Taxonomy of time-series, cross-section models 


Assumptions about 
Vector of slope . 
Intercept coefficients Disturbance term 
Model = B My 
(a) Common for all i, ¢ Common a a it oa - lies 
Kb) Common for all i, ¢ Common for all i, ¢ ( 
Ia) Varying over i Common for alll i, £ Fixed effects model 
II(b) Varying over ! Common for all i, # Random effects model 
Illia) Varying over !,! Common for all ie Fixed gies mds | 
TIN(b) Varying over f, f Common for all i, £ Random effects model 
ee Varying over i E(w’) = o21 or E(uw’) = V 


IV Varying over! 
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made about var(u). Suppose, for instance, one postulates 
E(uz)=0,  forallt;i=1,...,p 

E(ujt,,) = 9, for all sand i + j 
E(u,,u;,) = 0 for alli, j,and¢ + s 


These assumptions allow for heteroscedasticity of the disturbance term across 
units and for nonzero contemporaneous covariances between the disturbances in 
different units but rule out lagged correlations within and between disturbances. 
The resultant variance matrix is 


ol, 21, op 1, 
E(w’) =V=]% 1, o1,, -- 251 (10-22) 
1,1, o,,I,,, Fo1n 


The application of GLS to Eq. (10-21) using Eq. (10-22) would now yield the 
b.Luce. of B, 


by = (X’V~'X) -'x’v-!y 
bi 9,;, however, are unknown. They may be estimated by the following proce- 
jure, 


; Fit Eq. (10-21) by OLS and partition the residual vector into the subvectors e, 
(i= 1,..., p) relating to decision units. Then calculate 
ee 
Simi, = 
Substitution of the Sj in Eq, (10-22) gives a V matrix which may be used to 
compute the feasible GLS estimator. The usual inference procedures now apply 
asymptotically. Another version of model I(b) could be produced by adding an 
assumption of autocorrelated disturbances within each decision unit.+ 
Model II relaxes the assumption of a common intercept but retains the 


assumption of a common vector of slope coefficients for all decision units. The 
matrix formulation of this model is then 


y i “1 
Yo i, 0 0 X, |] % 
wr (del 0 XS]: | +e (10-23) 
Y, 0 0 i peay Xp |. 
B 
or 
y=Za+XB+u (10-24) 


t See Problem 10-5 and also J. Kmenta, Elements of Econometrics, Macmillan, New York, 1971, 
pp. 512-514, for a discussion of this case 
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where the definition of Z is obvious from the comparison of Eqs. (10-23) and 
(10-24). Define the matrix B as 


B=Z(ZZ) 'Z 
It is easily seen that B is an n X n matrix given by 
' J, (OVA 0 
B Sine} ane tS aS. ids 
0 0 Jy 
where 
Jn = Anntin 


is an m % m matrix consisting entirely of ones. From the definition of J,, 


¥ 
Be 
y 
where 
Tics a Y, 
pant fo ij 


Thus premultiplication of any n 1 vector by B will replace each observation for 
any decision unit by the sample mean of that variable for the decision unit. If we 
then define 

pP=I1,-B 
premultiplication by P will replace the original observations by the deviations 
from their unit sample means. It is also clear that P is a symmetric idempotent 
matrix, which is orthogonal to Z, that is, 


PZ=0 
Premultiplication of Eq. (10-24) by P then gives 
Py = (PX)B + Pu (10-25) 


Thus estimation of the B vector may be achieved by applying OLS to the data 
expressed in terms of deviations from group (unit) means. The resultant estimator 
is 

b = (X’PX) ‘X’Py (10-26) 
This is, of course, exactly the same vector as results from the application of OLS 
to Eq. (10-24). The normal equations are 

Z'Za + Xb = Zy 

X’Za + X’Xb = X’y 
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Solving the first equation for a, 
a= (Z’Z) '(Z'y — Z’Xb) (10-27) 


Substituting in the second normal equation and solving for b gives, after some 
manipulation, 


b = (X/PX) 'X’Py 


as before. From the definition of Z it may be seen that (Z’Z)' is a p X p 
diagonal matrix, namely, 


(ZZ) ' = diag{m~! m=! ++» m-'} 


and premultiplying an n x 1 vector by Z’ seryes to sum elements within each 
group. Thus Eq. (10-27) implies 


a, = ¥,— bX, -— +++ bX 7 = Mee (10-28) 


This model, which is designated as model Il(a), is usually known as the fixed 
effects model. The fixed effects are the intercepts a,, one for each group. It is 
usually assumed that the u vector in Eq. (10-24) is homoscedastic and nonauto- 
correlated so that OLS provides b.l.u.e.’s, though GLS estimators could be 
constructed on the lines of model I(b). The b vector of Eq. (10-26) is also 
sometimes referred to as the “within” estimator, since it is based on the 
within-group deviations (Y,, — ¥,) and (X,,, — X,,). Equations (10-21) and (10-24) 
have already appeared in Sec. 6-2 on Tests of Structural Change. The exposition 
in that section was solely in terms of a time-series application where the “groups” 
referred to p different subperiods, not necessarily all of the same length. Equation 
(10-21) is a restricted version of Eq. (10-24), and tests of the restrictions may be 
made in the context of OLS estimates as in Sec. 6-2, or in the context of GLS 
estimates as in Sec. 8-6. 

Model II(b) is the random effects, or error component, model. Instead of 
assuming a set of given (unknown) constants @,,..., a, for the p groups, a single 
intercept a is postulated, and the differential intercepts are merged with the 
disturbance term. The model is now formulated as in Eq. (10-21), namely, 


y=[i xi[p] + 


but the assumptions about u are 
Uj, = a; + &;, 


where the a, are drawn at random from (0, 2) and the ¢,, are drawn at random 
from N(0, 62). The a; are now increments (positive or negative) to the common 
intercept a. To derive the variance matrix of u we note that for the ith group we 
may write 

u 


i al, + 7 
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It then follows directly that 


E(um;) = 025, +021, i=1,....P 


o2 + 07 o2 73 o2 
= o2 o2 + 02 o2 
o2 o 2 o2 +o; 
ap ee 
=o2|p 1 -:+ pl =ojA 
p Pp 1 
where 
2 
oO, 
o2=02+0, and aie 
hi 
Since E(u’) =0, 
A 0 0 
V=E(w’)=0,/0 A 0 
0 0 A 
=0,1,0A 


The matrix A may also be expressed as 
A= (1- pL, + In 
This facilitates finding the inverse. Let 
A-' =X,J,, + Addn (10-29) 
where A, and ), are constants to, be determined. Multiplying out and noting that 
JZ = mJ,,, 
AAW! = (1 —p)AgIy + [(1 = e)Ay + mpd; + PALI, 
Equating the right-hand side to I, gives 
= —Pp A4,= Gaus 
(l-p\l-p+mp) * 1-P 
which, on substitution in Eq. (10-29), gives A~'. The GLS estimator of model 
Il(a) might then be obtained from Eq. (10-21) using 


nN (10-30) 


Ace 48 0 

vat On SFAR Sar (10-31) 
02 | lones Rje eee Sheen et: 
Odi oO: AT! 


The difficulty, however, is that V~' inyolves the unknown o2 and o;. Before 
dealing with this problem it may be shown that the GLS estimates can also be 
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achieved by applying OLS to suitably transformed variables. From Eq. ( 10-29) we 
may write 


A, +A, A, A, 
Aw A, A, +A, Ay 
eet i ASE sea i ee 
Letz=[z, z, ‘*: Z,]' denote a column vector of m observations on some 


variable. It then follows that 
ZAq'z =) (Ez) + AgLz? 


2 da [ee? = ee using Eq. (10-30) 


2 
m’p os 
ey ph pete ST aay 
fee 1l—-p+ mp” 
The form of this expression suggests defining a quasi deviation as 
Z=275c8 
and asking whether a constant c can be found such that 


2, 
Py aeitgs Sah Mis m’p ay 
Lz? = Lz TS pe ib 
Now Lz? = Lz? — (2em — c2m)z* 
Solving for c gives 
Ls |pacliced 
SraMSLnU Ey p+mp 
2 
4, 
ae Tes 10-32 
o; + mo2 ( ) 


recalling that p = o7/(¢2 + 07). It is customary to take the negative sign in Eq. 
(10-32), and thus we have 


m 
ZA z= ¥ (z,-cz) (10-33) 
t=1 
wheret 


Coon Wa (10-34) 


+z can thus represent the sample observations on the dependent or an independent variable for 
any given unit. 

This result is stated without proof in J. A. Hausman, “Specification Tests in Econometrics,” 
Econometrica, vol. 46, 1978, p. 1262. - 
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The estimation of Eq. (10-21) by GLS using the V~! defined in Eq. (10-31) then 
gives 


[or] -(. x]V~'fi, XI], XIV"'y (10-35) 


From the definition of V~! in Eq. (10-31) the elements in the matrix and vector 
on the right-hand side of Eq. (10-35) are of the form 


=} 
zA_ Zz; 


where z, and z, represent m 1 vectors. Thus the GLS estimator defined in Eq. 
(10-35) is equivalent to applying OLS to the quasi deviations 


Yee cY 


Xin = Xin = CXS (= 1,0.., ppt Hye. M3 f= 25-0. Kk 


Either procedure requires an estimate of the variances appearing in V-'orine. 
These may be developed as follows. The disturbance in this model is 


Uj, = a; + Ey 
Averaging over ¢ for unit i gives 
Uj = a, + & 
Averaging then over i gives 
ai=a+é 


and the usual decomposition of sums of squares gives 
=\2 ay) fos =\? 
¥ (my — #) = L(y - @)* + D(H ~ #) 

iyt ie it 


The resulting analysis of variance is shown in Table 10-2. 
Looking first of all at the within-group mean square, 


Lui, i) = Le a a) 


Recall that if x, X>-.-, X», are drawn at random from x ~ N(O, 02),+ 
“(act e 
He 2 
Thus £{E (o- ay) = (m= 1)o? 
t=1 
P m “4 2 
and E{E Slee a} = pm De 
i=l t=1 


+ The assumption of normality is not required for this result, only that x ~ iid(0, a). 
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Table 10-2 ANOVA of the disturbance term 


Density Expected 
Source Sum of squares function Mean square mean square 
=\2 i =\2 a 
Between groups )\(%,-a@) p-1 =p YL (a-i o2 + mo2 
it it 
Este Re 1 azn 19, 
Ww - “1 - 
thin groups (ue BY wm Se ay Ete = HY" 
22 
Total Ele = i) pm-1t 
it 


giving o2 as the expected, within-group mean square. Similarly, 


¥(a,-a@) =D (a,- 2) + L(4-2) + 2¥(a, — «)(é, - é) 


it Eee 


P 
ut e| tae a)'} = (p~ 1)82 
i=l 
Pp 2 
and e{d (4-2) }=(- Nt 


since the é, are drawn at random from N(0, o7/mt). Thus 


= =\? 
E{E(a,— a)'} = m(p ~ Yo? + (p ~ to 
i,t 
and the expected between-group mean square is ma? + 02. The u,, in Table 10-2 
are, of course, unobserved, but we can estimate the relevant disturbance variances 
by substituting estimated u’s in these formulas. 
The estimation procedures may be summarized as follows. 


OLS on transformed data 


1. Fit the basic model, Eq. (10-21), by OLS and obtain the n x 1 vector 4 of 
OLS residuals. Compute also the mean residual @, for each unit, and note 
that 
a=0. 

2. Compute 


3. Compute the quasi deviations y,, = Y,, — c¥,, and so on, and apply OLS. 
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The direct application of the GLS formula requires estimates of p = o,/o, and of 
o2. These are also obtained from the OLS residuals. The steps are as follows. 


1. As in OLS procedure. 


2. Compute 
ze § P(e, a) 
5 p(m= 1) Zy7=1 a ‘ 
Pi ys eye 
aogiees (a,- a) - 2} 
m\p-1 72, 
s? = 52 + 52 
s? 
ae 


3. Using 6 and s?, compute V and the GLS estimator defined in Eq. (10-35). 


The estimation of the variance component from the OLS residuals is not to be 
recommended when lagged values of ¥ appear in the X matrix. Since p = o2/ 
(o2 + o2) is constrained to be in the (0, 1) interval, a grid search over this interval 
for the ML estimator is a feasible procedure.} 

The final question with respect to model II is the choice between fitting either 
the fixed effects or the random effects model. The choice basically has to be made 
by the researcher based on the institutional realities relevant to the problem being 
studied. Returning to the examples given at the beginning of this section, suppose 
certain monetary /fiscal policies are set in place in an attempt to reduce unem- 
ployment rates across the country, and after some time an analysis is made of the 
experience of the various states. As a result of historical developments, the states 
have variable mixtures of industrial, commercial, private, and public structures. 
One would thus expect differential effects across states, which would be modeled 
appropriately by the fixed effects assumption. On the other hand, if we look at the 
per capita consumption of gasoline in the OECD countries, we will certainly 
observe very different levels of the dependent variable in different countries. 
However, it is also true that for tax and other reasons the real price of gasoline 
has historically been very different in different countries. For sound economic 
reasons this may be expected to have /ong-run effects on the size of automobiles 
and on per capita gasoline consumption. Inserting dummy variables to allow 
different intercepts across countries removes this variation from the data, and the 
“effects” of the explanatory variables are estimated solely from the within 


} See P. Balestra and M. Nerlove, “Pooling Cross-Section and Time Series Data in the Estimation 
of a Dynamic Model: The Demand for Natural Gas,” Econometrica, vol. 34, 1966, pp. 585-612; and 
G. S, Maddala, “The Use of Variance Components Models in Pooling Cross-Section and Time-Series 
Data,” Econometrica, vol. 39, 1971, pp. 341-358. 
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estimator, Eq. (10-26), which is based on the within-country variation and is not 
influenced by the between-country variation. Thus a fixed effects model would be 
liable to underestimate the price elasticity. The random effects model would be 
equally inappropriate since it would attribute significant variations in consump- 
tion to unidentified stochastic factors rather than to price. In this case a more 
sensible estimator of long-run price and income effects would be obtained by 
computing the between estimator based on country (group) means. Averaging Eq. 
(10-19) over groups gives 


¥=a+ BX, +2} +8 Xat i  i=1,....P 


This is the same as transforming the original data by premultiplication by the B 
matrix defined earlier and computing the OLS estimator} 


[5] -[t X]'Bli, X}] ‘li, X]By (10-36) 


The random effects model would seem appropriate when the decision units (say, 
households or firms) have been drawn from some population of such units. 
Conditional on the explanatory variables, there will be an average level of 
response in the population, and individual levels will vary around that average as 
a consequence of unidentified stochastic factors. 

Looking at the statistic defined in Eq. (10-34), which produces the quasi 
deviations underlying the GLS (random effects) model, we see 


1. Asm — oo, ¢ > 1, and the GLS (random effects) estimator of B tends to the 
fixed effects estimator of B. 

2. As o2 becomes very large relative to «7, c > 1, and again the random effects 
and fixed effects estimators of B will tend to coincide. 

3. As o2 > 0, c > 0, and the random effects estimator would tend to the OLS 
estimator (X’X)~'X’y. 


Returning to the taxonomy in Table 10-1, model III allows the intercept to 
vary over units and time periods, while retaining the assumption of a common B 
vector for all i, t. This again may be estimated by a fixed effects or random effects 
approach. The former extends Eq. (10-24) to include dummy variables for the 
time periods, taking care to use only m — 1 such dummies in order to avoid a 
singular data matrix. The random effects model postulates the disturbance to be 


uj, = a, HY, + Ey 


where the y’s are assigned at random to the time periods from some postulated 
distribution. Just as a, is assumed common to the ith unit for all time periods, so 


+ For a lucid and practical discussion of these issues see J. M. Griffin, Energy Conservation in the 
OECD: 1980 to 2000. Ballinger, Mass., 1979, Chap. 2. 
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is y, assumed common to all units in the sth time period. The extensions from 
model II are relatively straightforward, and we will not go into them here.f 

Model IV allows both the intercept and the B vector, or some components of 
it, to vary across units. This model has already been studied in Sec. 6-2 under the 
simplest possible assumptions about the disturbance term and in Sec. 8-6 in the 
context of the SURE model. The random effects version of model IV might also 
be extended to allow for time-specific as well as unit-specific error components. In 
testing for the stability of the B vector (whether across unit or over time) it is then 
especially important to use the procedures of Sec. 8-6 with an appropriately 
specified variance matrix for the disturbance.¢ 


10-4 VARIABLE-PARAMETER MODELS 


This topic has already appeared in several places. Sec. 6-2 on structural change 
investigated variations in some or all of the parameters of a relation, but it was 
known a priori at which point possible structural breaks might have occurred 
(peacetime, wartime, and so on). Section 10-2 on spline functions showed how 
different functions might be fitted so as to meet at the known join points. Section 
10-3 on time-series and cross-section data considered many possible variations in 
parameters, but again, as in Sec. 6-2, there were obvious points at which such 
changes might be expected. Only in Sec. 10-1 on recursive residuals was there 
some discussion of the case where the B vector might change at unknown points. 
We must now consider cases where there is no a priori information on the 
observational points at which structural changes might have taken place, and in 
this brief section we will consider just two possible approaches. The approach of 
switching regressions is based on the assumption that there is a known (small) 
number of different regimes, but the switching points are unknown. The other 
approach is based on the assumption of continuous parameter variation. 


Switching Regimes 


The simplest case of switching regimes is based on the assumption of just two 
different regimes. The switch may depend on time or on a “ threshold” value for 
some variable, or it may be triggered stochastically. For instance, wage and price 
decisions may be different in periods of low inflation and in periods of high 
inflation, The pioneering treatment of switching regimes is due to Quandt.§ To 


+ Reference may be made to the articles by Maddala and by Balestra and Nerlove already cited, 
and also to T. D. Wallace and A. Hussain, “The Use of Error Component Models in Combining, 
Cross-Section with Time-Series Data,” Econometrica, vol. 37, 1969, pp. 55-72; and Y. Mundlak, “On 
the Pooling of Time-Series and Cross-Section Data,” Econometrica, vol. 46, 1978, pp. 69-86. 

See Problem 10-7 and B. H. Baltagi, “An Experimental Study of Alternative Testing and 
Estimation Procedures in a Two-Way Error Component Model,” Journal of Econometrics, vol. 17, 
1981, pp. 21-49 

§ See S. M. Goldfeld and R. Quandt, Studies in Nonlinear Estimation, Ballinger, Mass., 1976, 
Chap. 1, and references therein. 
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illustrate the approach suppose we have ¢ = 1,..., sample observations and the 
hypothesis is that 


Regime 1: y, = a, + B,x, + u,, holds fort < 1* 
Regime 2: y, = a + B,x, + uz, holds for t > 1* 


where ¢* is unknown. Assuming the u’s to be normally and independently 
distributed with zero means and variances o7 and a}, the log likelihood is 


a i yi BE epic tamil ee a 
InL= 7 nda 7 Ing; 5) Ino; 
1 = 2 1 n 5 
sleet Seed thee 7 0p. 7) ime pe nee Seite i Xn), 
20? LO 1 ~ Bix,) 202 A, 2 — B2x,) 


(10-37) 


ML estimates of a, 8,, and a? (i = 1,2) would be given by two separate OLS 
regressions for any assumed value of ¢*. On replacing these parameters by their 
ML estimates the last two terms in Eq. (10-37) become 


SOLS ISN Ea Ga til 
26? 26} 2 
and so 
" Abe 
InL= —2in2q—2- Sing? - 27> ine? (10-38) 


2 Catalin’) 2 


An estimate of the switch point ¢* could then be made by evaluating Eq. (10-38) 
for all possible values of t* and choosing the one that maximizes the likelihood. 
With n sample observations and two variables the possible range for (* is from 
t* = 3 to :* =n — 3, implying the calculation of n — 5 pairs of regressions. 
Riddell, however, has recently pointed out that the computational burden is 
considerably reduced by making use of recursive residuals.+ Consider the set of 
forward recursive residuals w,, w,,... . From Eq. (10-7) we have 


RSS, = RSS,_, + w? 


on 
Thus RSS,. = } w? 
t=3 
and 
ir RSS,. 
Ne ect e 


In a similar fashion 63( t*) can be constructed from the set of backward recursive 
residuals. Thus just two passes of a recursive residuals program will generate all 
the data required to find the maximum of Eq. (10-38). 


+ W. C. Riddell, “Estimating Switching Regressions: A Computational Note,” Journal of Statistical 
Computation and Simulation, vol. 10, 1980, pp. 95-101. 
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The null hypothesis that no switch occurred may be examined by means of 
the likelihood ratio statistic.7 Let 


& 


_ L(6) 
L(Q) 

where L({2) is the unrestricted maximum of the likelihood function over the 
entire parameter space. In this example it is the antilogarithm of the maximum of 
Eq. (10-38) since it is assumed to be known that there is at most one switch point 
and the restriction of a single regression (no switch point) has not been imposed. 
L(@) is the maximum of the likelihood function over the subspace w C ® to 
which one is restricted by the hypothesis. In this problem it is the maximum value 
of the likelihood for a single regression. Under the hypothesis of no switch 


ny 


In L(6) = =2in2e 222 ine? 


where 


& and f being the OLS coefficients. Thus 


(32) 7a" 


(@)"" 


The conditions required for —2 In A to follow an approximate x? distribution are 
not fulfilled since the likelihood function is only defined for integral values of r*. 
However, the graph of A (or In A) against ¢ can be instructive, as shown in Brown, 
Durbin, and Evans, especially when considered in conjunction with other tests.£ 
For a discussion of procedures when the switch is triggered in various other 
deterministic or stochastic fashions the reader should consult Goldfeld and 
Quandt.§ A special case of switching regimes arises in the context of disequi- 
librium models where in some periods we have observations on the demand 
function and in others on the supply function. 


‘= 


+ For a brief account of likelihood ratio tests see P. G. Hoel, Introduction to Mathematical 
Statistics, 4th edition, Wiley, New York, 1971, pp. 211-217. p 

+R. L. Brown, J. Durbin, and J. M. Evans, “Techniques for Testing the Constancy of Regression 
Relationships over Time,” Journal of the Royal Statistical Society, ser. B, vol. 37, 1975, pp- 149-192, 
especially p. 161. 

§S. M. Goldfeld and R. Quandt, Studies in Nonlinear Estimation, Ballinger, Mass., 1976. 

VA treatment of disequilibrium models is beyond the scope of this book. Some important 
references are R. C. Fair and D. M., Jaffee, “Methods of Estimation for Markets in Disequilibrium,” 
Econometrica, vol. 40, 1972, pp. 497-514; R. C. Fair and H. H. Kelejian, “Methods of Estimation for 
Markets in Disequilibrium: A Further Study,” Econometrica, vol. 42, 1974, pp. 177-190; T. Amemiya, 
“A Note on a Fair and Jaffee Model,” Econometrica, vol. 42, 1974, pp. 759-762; G. S. Maddala and 
F. D. Nelson, “Maximum Likelihood Methods for Models of Markets in Disequilibrium,” 
Econometrica, vol. 42, 1974, pp. 1013-1030; S. M. Goldfeld and R. E. Quandt, “Estimation in a 
Disequilibrium Model and the Value of Information,” Journal of Econometrics, vol. 3, 1975, pp- 
325-348. 
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Continuous Parameter Variation 


The capacity of econometric theorists to “invent” new varieties of models with 
continuous parameter variation tends to exceed the willingness and sometimes 
even the computational ability of researchers to apply them to real-world situa- 
tions. We will illustrate two main approaches, namely, random coefficient models 
and adaptive regression models, and in each case emphasize one or two major 
publications without attempting to give a comprehensive coverage of all recent 
theoretical developments. 


Random coefficient models. The traditional single-equation model y = XB + u 
puts the ignorance or uncertainty into the disturbance vector u, while the B vector 
is assumed to be fixed at all sample points. An alternative assumption is to make 
the B vector stochastic and write the model as 


y= (B, + »1;) + (By + 02;)Xyj + -09 + (B+ 04 ;) Xej fom lynn 
(10-39) 


The £’s in Eq. (10-39) are unknown constants common to all sample points. The 
vy are stochastic variables which determine the coefficient vector for the jth 
sample point. The n sample points might, for example, be a cross section of 
households where important explanatory variables may be unobserved, and their 
influence affects slope coefficients as well as the disturbance term. The reaction of 
mortgage debt to, say, the measured rate of interest may well depend on the 
unobserved age of the head of household. There is no need to insert the usual 
equation disturbance term in Eq. (10-39) since it will merge with v,,. Equation 
(10-39) may be rewritten as 


Y=xPruy f= T...n (10-40) 
where uy = Xi, 

x, = [1 Xp Xu] 

y= [oy yeni oy] 


Assumptions about the y, are required to make the model operational. A simple 
set of assumptions is 


E(v,) =0 J=l,...,0 
a 0 0 
E(vyvj) =| 0 O}=A vn (10-41) 
0 a, 
E(vy/) =0 j=l. nies 


The stochastic elements in the coefficients are thus assumed to have zero means 
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and to be uncorrelated between sample points and also between different coeffi- 
cients for any given sample point. The last assumption is possibly the least 
plausible. If, for example, age has an effect on the reaction of mortgage debt to 
the rate of interest, it may have a related effect on the response to income. These, 
however, are the assumptions of the original Hildreth-Houck random coefficient 
model.} 

From Egg. (10-41) the disturbances in Eq. (10-40) have the following proper- 
ties: 


E(u,)=0 fH lyeeeyn 
E(u?) = E(x/v¥jx;) Jalen 
PuxyAny 
E(uju;) = 0 = lest, Gla 


Since A is diagonal, var(u,) simplifies to 


=a (10-42) 
rane 2 ‘ 2 
where xj = [1 XZ) on weieXey 
and aw =[a, a «> %) 


Thus Eq. (10-40) constitutes a model with a heteroscedastic disturbance term, the 
variance at each sample point being the same linear combination of the squares of 
the explanatory variables at that point. Collecting the n variances in Eq. (10-42) 
gives 


where X denotes the matrix obtained from X by squaring each element. The form 
of this relation suggests that if estimates of the left-hand vector could be obtained, 
a regression on X could yield an estimate of a. Looking at the residuals obtained 
from the OLS fit to Eq. (10-40), e = y — Xb, we know from Chap. 5 that 


E(ee’) = ME(uu’)M (10-43) 
where M =I-— X(X’X) |X’ 


+(C. Hildreth and J. P. Houck, “Some Estimates for a Linear Model with Random Coefficients,” 
Journal of the American Statistical Association, vol. 63, 1968, pp. 584-595. 
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It follows from Eq. (10-43) that 
Ee? 
E(é)=| : | =Mo? 
Ee? 
Thus E(é) = MXa (10-44) 


Equation (10-44) leads to the following procedure for constructing a feasible GLS 
estimator. 


1, Fit OLS to Eq. (10-40) and square each residual to obtain the vector é. 

2. Regress ¢ on MX, which can be constructed from the original data matrix X, 
to obtain an estimated vector 4. 

Substitute & in Eq. (10-42) to obtain estimates H of the variances of the u’s. 
4. Using the s? obtain the GLS estimate of B in Eq. (10-40). 


i 


As Hildreth and Houck pointed out, there is no constraint on step 2 of this 
process that ensures that the &’s are all nonnegative. They suggest setting any 
negative @’s to zero. They also suggest a number of other methods of obtaining 
consistent estimators of B, but it is difficult to know how to choose between them, 
and their small sample properties are unknown. A small sample test of the 
significance of the a vector (that is, whether the v’s have nonzero variances) might 
be based on the OLS regression of é on MX, but the precise significance levels are 
unknown. 

The Hildreth-Houck model is applicable to a sample where there is just one 
observation per unit. The Swamy model is designed for cross-section time-series 
data.} The data for the ‘th unit or group are modeled by 


y=X,(B+¥)u; i=1,...,p (10-45) 


There are p separate units with m sample observations on each. The X, are all of 
order m X k and rank k. The B vector of k coefficients is common to all units. The 
y, vectors model the stochastic variation of the coefficient vector across units. For 
all i, 7 = 1,..., p it is assumed that 


Y. Eu, =0 Btu) = (%" if dtl 
of 0 ifisj 
2. Evy,=0 rn (10-46) 
mes ifi=j 
a Be fa ifizj 
4. y, and u, are independent 


+P. A. V. B. Swamy, “Efficient Inference in a Random Coefficient Regression Model,” 
Econometrica, vol. 38, 1970, pp. 311-323. 
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The model would be written in full as 


y X vy, uy, 
Kye 70: 
v2 X, : bs v u 
a Birt) by Maes ed SE os (10-47) 
: ; OnnnOrcdi-e Taw al he : 
\ Xx, 1 alae. u, 


The disturbance vector is the sum of the last two terms in this expression. Using 
the assumptions in Eqs. (10-46), the variance matrix for the composite dis- 
turbance term is 


X, AX) + 0,1, 0 se 0 
v= 0 X, Ax; +o, =r: 0 
0 0 x, AX, + 3,1, 


(10-48) 
A feasible GLS estimator of B could be constructed by first obtaining estimates of 
A and the o,, in V. These estimates are obtained as follows. 
1. Compute the OLS vectors for each unit separately, that is, 
b, = (X:X,)'Xiy, 


and the vectors of OLS residuals e, = y; — X,b,- 
2. An unbiased estimator of o,; is given by 


__ ee 
Si im —k 
3. An unbiased estimator of A is given by 
S, 2 -1 
A= SL sa(XiX: 
pt Dy (<%,) 
where 
P ide ge 
5,= bb -— Ub Yb 
Pint i= 


i=l 
4, Substitution in Eq. (10-48) gives V which may then be used to derive the 
feasible GLS estimator of B in Eq. (10-47) as 


by = (KV TIX) VAY. 


where y and X denote the stacked vector and matrix in Eq. (10-47). The 
estimated variance matrix is 


Ce bia XN) 
and the conventional tests on b, would be valid asymptotically. 


Before fitting Eq. (10-47) by the above procedure it is desirable to test 
whether the coefficient vectors are truly different across units. Letting B; = B + ¥, 


\ 
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denote the k X 1 vector of coefficients for the ith unit, we set up the null 
hypothesis 


Hy: B,=B—) =--- =p. =8 


This hypothesis may be tested by computing the test statistic defined in Eq. (8-91) 
for the SURE model, where the & in that formula is the variance matrix of the u’s 
defined in Eq. (10-46), line 1. The R matrix would be set up by reformulating the 
null hypothesis as 


B, = B, 
B, = B; 
B, = B, 


However, as shown in Sec. 6-1, the same test statistic can be derived from the 
residual sums of squares from the restricted and unrestricted versions of the 
model. Under the null hypothesis the restricted model is 


yy x, uy 
ye x u 
and the unrestricted model is 
y x, B, u 
% X, |B uD 
The assumptions about the u, in Eq. (10-46), line 1, give 


oy 


E(uu’) = id el 


Thus GLS estimates of each model are achieved by applying OLS to transformed 
variables, where the transformation is to divide the observations for the ith unit 
by \o,, (i = 1,..., p). The residual sum of squares from the restricted model is 
then 


, 1 1 
e = Py = 
A ae MY, oe 5 y,/X,b 


oe! 
where b= [zEexx,] ro xy, (10-49) 


i 


and all summations are over i = 1,..., p. The residual sum of squares from the 
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unrestricted model is 


1 1 
ca= roy =) Le yiXib, (10-50) 


where b, = (X{X,) 'X/y; (10-51) 
Thus 
eyes — fe — DyXb, -— E> yXb 
Gj Gi; 


l 
=I—(b, — B)Xy, 
i 
= Ea, — b)'X;X,b, 


Finally it may be shown that} 


Pp 
ee, Ce = > +, — b)’X’X,(b, — b) (10-52) 


i=l 


where b and b, are defined in Eqs. (10-49) and (10-51). If, in addition to the 
assumptions already made, the u’s are normally distributed, then under the null 
hypothesis, 


_ eres €e)/K( P=") _ pLR(p - 
Fo eS oe ened 


This development has, however, used the unknown o,,. Replacing them by the 
estimated values s,,, the same test statistic can be computed, but it will now just 
have asymptotic validity. This model has been extended to include lagged 
variables and more complicated assumptions about the y, vectors.¢ 


Adaptive regression models. A different form of modeling variable-parameter 
schemes is the adaptive regression model associated mainly with the names of 
Cooley and Prescott.§ The model is designed for application to time-series data. 
We will illustrate the basic idea first of all with reference to the intercept term. 


{ See Problem 10-9. y | 
$See P. A, V. B. Swamy, Statistical Inference in Random Coefficient Regression Models, Springer- 


Verlag, New York, 1971; P. A. V. B. Swamy, “Criteria, Constraints and Multicollinearity in Random 
Coefficient Regression Models,” Annals of Economic and Social Measurement, vol. 2, 1973, pp. 
429-450: and P. A. V. B. Swamy, “Linear Models with Random Coefficients,” in P. Zarembka, Ed., 
Frontiers in Econometrics, Academic Press, New York, 1974. 

§T. F. Cooley and E. C. Prescott, “An Adaptive Regression Model,” International Economic 
Review, vol. 14, 1973, pp. 364-371; T. F. Cooley and E, C. Prescott, “Tests of an Adaptive Regression 
Model,” Review of Economics and Statistics, vol. 55, 1973, pp. 248-256; and T. F, Cooley and E. C. 
Prescott, “Systematic (Non-Random) Variation Models: Varying Parameter Regression: A Theory 
and Some Applications,” Annals of Economic and Social Measurement, vol. 2, 1973, pp. 463-474. 
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Consider the relation 

y, = a, + Bx, +u, (10-53) 
The additive disturbance u, shifts the function up or down period by period. 
Cooley and Prescott make the additional assumption that the intercept term is 
subject to change according to 

a,=a,_,+ 91 (10-54) 
Assume the u’s and v’s to be independently distributed with zero means and 
variances 0, and 6, and assume also that u, and v, are independent for all , s. 
The model is similar to the conventional regression with fixed parameters and an 
autoregressive disturbance process. The difference is that autoregressive shocks 
are subject to exponential decay while the effects of the v’s persist. An AR(1) 
disturbance process would give 

y= a+ Bx, +(e, + pe, + pre, +-+- + ple, + plug) 
while the adaptive model gives 
Y, = Wy + Bx, + (v,_, + 0,2 +--+ + 09) + u, 

where a, is the intercept in the immediate presample period. One might estimate 
the adaptive model in the form just given, in which case the parameters would be 
ao, 8, 9,, and o,. Cooley and Prescott, with an emphasis on forecasting, express 


the model in terms of a, , ,, the intercept in the first postsample period. From Eq. 
(10-54) 


n 
er ee Srat eae 


so 
Thus Eq. (10-53) may be written 
Je = O41 + BX, + w, (10-55) 


n 
where w,=u,- D2, 
s™t 

Estimation of Eq. (10-55) is simplified by reparameterizing the disturbance 
variances as 

o,=(l—y)o? o2=yo? O<yK<1 (10-56) 
The larger y, the greater is the importance of the “permanent” component v in 
the shift of the function relative to that of the transient component u. Using Eqs. 
(10-56) and the assumptions previously made about u and v gives the variance 
matrix of w as 


E(ww’) = 072 
where 
a n-1 BQ 924) 
1 0 n-1 n-1 Zt bay 204 
2=(1-y)}0 hobs ag klge as ear ei 
: 2 2 BMD 
1 ine 
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If the u’s and v’s are normally distributed, the log likelihood is 


sep oa we gt | & PAS cl 
In L zin2a zine 7 inl@| 3929 xp)'Q-'(y — XB) 
where XB represents the systematic part, a, ,) + BX,, of Eq. (10-55). If y, and 
hence 2, were known, the ML estimates of B and o” could be computed from 


B= (XQ-'x) 'x'Q'y 
and aay — xp)'2-'(y — xp) 


However, y is unknown, but it is confined to the interval (0, 1), which suggests a 
grid search. Substituting B and 6? for B and o? in the log likelihood gives the 
concentrated function 


n 


In L = constant — qins? - Fina (10-58) 


Maximizing Eq. (10-58) over y yields 7, which then gives &. The feasible GLS 
estimators are then ; 


b, = (x@>'x) xO-ly 
and st=(y— Xb,)‘2-'(y — Xby) 


The asymptotic distribution of by is normal with mean B and variance matrix 
o?(X’‘2-'X)~!. The asymptotic variance matrix for (y, 07) is more complicated 
and is given in the first of the Cooley-Prescott papers. 

The idea of adaptive coefficients can obviously be extended to slopes as well 
as intercepts. This is done in the third of the Cooley-Prescott papers. To illustrate 
the treatment consider the three-variable model 


By 
Y,=[1 Xx Xsp]} Bor 
Bs, 
The assumptions now are 
= BP +u, 
Bi, = Bh + Yas i= 1,2,3 (10-59) 


=p? 
f= BE. + On 


where the superscript p denotes the permanent part of a coefficient. The Cooley- 

Prescott assumptions about u,, and v,, are 
u, ~ N(0,(1 = y)072,) 
y,~ N(0, yo?) 


where u, and y, are 3 X 1 vectors. In addition the u, and the y, are serially 
independent, and u, and y, are independent for all s, t. The new feature is the 
appearance of the 3 X 3 variance matrices >, and &,. For the estimation method 
to work, these matrices have to be known up to scale factors. Thus they can be 


=1...,7 (10-60) 
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normalized by setting, say, the element in the top left-hand position to unity. 
Writing 2, as 


en 
z,=|9 m om 
0 of 035 
implies that w,, is independent of uz, and u;, for all , that the random 
components in , and B, have variances proportional to 03, and 035, respectively, 
and that these same random components have a covariance proportional to 035. If 
one has no reason to expect a nonzero covariance, o34 is set at zero and 2, 
becomes diagonal with only two elements to specify. If one assumes that the 
intercept is the only coefficient subject to transitory changes and that the 
permanent changes are independent, the matrices become 


eso) 50 ee 0 0 
z,=|0 0 0 Ds (10) Soe nO) 
00 0 0 0). as 
Finally if one assumes the slope coefficients to be constant, the matrices reduce to 
Be OL 0 
x,=2,=]0 0 0 
00 0 
and By, = Bf + uy, 
Bi, = BR + vy, 


which is simply another way of writing the adaptive intercept case already 
studied. The intercept is then 8? (= a,), the equation disturbance is u,,, and 
pee Digs 

For the general case of variability in all coefficients Cooley and Prescott 
suggest that unless there is special a priori knowledge, one assumes the matrices 
Z,, and &, to be equal. In one practical application the diagonal elements in this 
common matrix were set equal to the estimated sampling variances of the 
parameters computed under the assumption of parameter constancy. The authors, 
however, report that losses in efficiency are surprisingly small, even for sizable 
errors, in specifying 2, and &,. 

The general model may now be sketched briefly: 


Y, = BP  Gh=Minss on 


where x, is the k x 1 vector of explanatory variables at time /, including unity in 
the first position to take care of the intercept. The variable-parameter assump- 
tions are 
B= 8? tu, OB = Be ty t= I,..., 7 
It then follows that 
ntl 


ears heh aD Ys 


s=ttl] 
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and so 


Y, = x’B?,, + w, (10-61) 


where w= xiu,—x, DY 
and emphasis is placed on estimating the permanent coefficients for the first 
postsample period. The variance matrix for the disturbance term in Eq. (10-61) is 

E(ww’) = o2[(1 — y)R + yQ] = 0° (10-62) 
where R is a diagonal matrix with 

r= X2X, 

and Q is defined by 

q, = min(n—i+ lyn —j + 1)x;2,x, 
Given 2, and 2,,Q depends only on y. Thus a grid search over the (0, 1) interval 
will yield a 7 which in turn gives & and the estimators 

be. = (xG7'X) 'xOrly 
(y — Xb?. ,)'Q"'(y — Xby..1) 


sa 
n 


and 


The grid search for 7 is in terms of the concentrated likelihood function Eq. 
(10-58), with @ now defined in Eq. (10-62). 


10-5 QUALITATIVE DEPENDENT VARIABLES 


We saw in Sec. 6-3 on dummy variables that there was no essential difficulty in 
the incorporation of qualitative variables in the X matrix. It is, however, quite a 
different matter when the dependent variable is qualitative or categorical in 
nature. We may distinguish three main cases. 


1. Dichotomous, binary, or quantal responses. These can be characterized by a 
variable Y which takes on the value one or zero according to which of two 
possible results occurs. For example, Y, = 1 or 0 if individual i dies (lives); if 
person i goes to college (does not go to college); if family 7 goes abroad on 
vacation (does not go abroad); and so on. 

2. Polytomous responses. This case refers to more than 
Thus a family may have 
no vacation 
a vacation in the United States 
a vacation in Europe 


«a vacation elsewhere ; 
3. Limited dependent variable. This includes both cases 1 and 2 as special cases. 
but is also more general. One may have a quantitative dependent variable 


two possible responses. 
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which is subject to some limit, whether upper or lower, or both. This is also 
referred to as the case of censored, or truncated, variables. 


Space forbids a treatment of all three cases. We will concentrate on the basic 
ideas underlying the binary case, which are also the foundation for any treatment 
of the more complicated cases. 


A Single Dichotomous Variable 


Suppose the members of a union who have been on strike in a wage dispute are 
now being asked to vote on a specific wage increase wy. Let us assume that each 
worker has a reservation wage increase and there is some distribution f(w) of this 
figure over the population of workers. The response of an individual worker is 
denoted by the dichotomous variable 


ye 1 if the worker accepts the offer 
0 if the worker rejects the offer 


A worker accepts the offer if it exceeds his or her reservation figure. Thus the 
proportion of the population accepting the specific offer wo is given by 


% = [°F dw (10-63) 


Management would clearly like to know as much as possible about the distribu- 
tion f(w). What value of wo, for example, would be required in Eq. (10-63) to 
yield a probability in excess of, say, 0.5? If the distribution f(w) remained 
constant over a sequence of contracts and various wage increases were subjected 
to ballots, estimation of the parameters of f(w) would be a possibility. Alterna- 
tively, at a given period in time, one might imagine a government mediator 
sampling various groups of workers with a variety of hypothetical wage increases 
in an attempt to chart the /(w) distribution. 

The main use of this type of analysis has not been in economics but in 
bioassay.} Applications in economics are, however, increasing with the ever 
expanding supply of micropanel data. In bioassay a specific dosage zy of, say, a 
poison is administered to each member of a population (insect, animal, human). 
The responses of the individual members are presumed independent of each other. 
For a great variety of reasons the tolerance to the poison varies from individual to 
individual and may be described by some distribution f(z). If the tolerance is less 
than the dosage, the individual succumbs to the poison. Thus the proportion of 
the population dying at dosage Zp is 


m= [°f(2) ae 


Finney suggests that the distribution of tolerances is often skew and approxi- 


+ Two basic references are D. J. Finney, Probit Analysis, 3d edition, Cambridge University Press, 
New York, 1971; and D. R. Cox, The Analysis of Binary Data, Methuen, London, 1970. 
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mately log normal. Thus a transformation of tolerances (and dosage) by 
x=Inz 


will render f(x) approximately normal. The dosage-response curve would then be 
represented by the cumulative normal distribution as shown in Fig. 10-3, where 
x ~ N(p, 07). At dosage x, the proportion dying is read off from the curve as 7, 
at x, the proportion is 7, and so forth. The practical problem is now the 
estimation of » and a7. Suppose, to this end, an experimenter selects a set of 
dosages X,, X2.-++) Xx The ith dose is administered to 7, individuals and the 
proportion p, dying is measured. On the assumption of a normal distribution for 
the tolerances the sample proportions will be scattered around the cumulative 
curve in Fig. 10-3. The use of p, to estimate p and o? is difficult since p is a 
nonlinear function of x. The probit transformation linearizes the relationship and 
makes the estimation of p and o° relatively straightforward. First define 

Te tt 

=e o 


Thus y ~ N(0,1) 


and any dosage x, can also be expressed in terms of y. The probability of death 


with dosage xy is now given by 
m™ = F(0) (10-64) 


where F(-) is the cumulative standard normal distribution and yy = (Xo — B)/o- 
This is shown in Fig. 10-4, which is simply a repeat of Fig, 10-3 with the 
horizontal axis translated to y. Inverting Eq. (10-64) gives 


a Xo7 ht 
F°"(m) = 9 = 2 (10-65) 


Figure 10-3 
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Figure 10-4 


Given a value of y, one can read off the corresponding 7. Conversely, given 7, one 
can read off the corresponding value of y. The y variable is defined as the normal 
equivalent deviate (n.e.d.) or by the somewhat unattractive term “normit.” A 
probit is defined as 
Probit = y + 5 

From Eg. (10-65) there is an exact linear relationship between the n.e.d. and 
dosage or, equivalently, between probit and dosage. The n.e.d. will be negative 
whenever 7 < 0.5, whereas the probit will almost never be negative.} Fisher and 
Yates give a table transforming percentages to probits.t 

In a typical experiment dosages x,, x,..., x, are administered tom), 15,.--, 
n, Subjects, respectively. The resultant proportions p,, p2,.-., Py are measured. 
The estimation procedure then follows directly from Eq. (10-65). 


1. Convert the sample proportions p,, p,..., p, into n.e.d.’s and plot against 


dosage x. 
2. If the scatter in step 1 is approximately linear, then fit the regression 
ED = a + bx (10-66) 
where§ 
a = estimate of —* 
o 
F 1 
6 = estimate of — 
o 


+ The number 5 was chosen to eliminate negative probits, since negative standard normal deviates 
with absolute values approaching 5 will almost never be found. As Finney explains, “At a time when 
most biologists lacked even simple calculating machines, and many had little skill in statistical 
arithmetic, avoidance of negative quantities was an appreciable practical advantage.” D. J. Finney, 
op. cit., p. 23, 

+R. A. Fisher and F. Yates, Statistical Tables for Biological, Agricultural and Medical Research, 6th 
edition, Oliver and Boyd, Edinburgh, 1963, Table IX. 

§ If probits are used instead of n.e.d.’s, a is an estimate of 5 — n/a. 
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A simple OLS regression would be unbiased but inefficient since it ignores the 
properties of the error structure. A GLS estimator may be obtained by taking 
account of the likely nature of the errors. Write the sample proportions as 


Pi=% + & jecad Bean 4 
Thust 


=) | 


pre binomial, 


The exact relationship is 


P\(n)=-£+ty, i= lyes 8 


The computed n.e.d.’s are given by 
F-'(p,) = F-'(m, + &) 


Applying a first-order Taylor expansion 
—1 


F-(p) =F Mm) +e 


‘dp; pi 
Thus 
1 
F-\(p,) = a Et ety (10-67) 
dF! 
where uj = 8G 
Pi \p=m, 


Returning to 
a | 2 
= F(y,) = pe 
p, = F(s) ee dy 


a = ew /2=Z, — ordinate of the standard normal curve at y;, 
Ly} 7 


Thust 


the ith group is subjected to dosage x; 


+ The binomial distribution applies since each individual in 
individual responses are assumed to be 


and hence to a probability 7, of death or whatever. Moreover, 
independent of one another. 
+ p = F(y) is a monotonic function, and so is its inverse. We have 


dp =—-dy and ray 


424 ECONOMETRIC METHODS 


and var(u,) = itl 35 (10-68) 
n,Z; 

The regression equation (10-67) thus has a heteroscedastic disturbance given by 

Eq. (10-68). Feasible GLS estimators would be achieved by computing a weighted 

regression of the empirical n.e.d.’s on dosage x using n,Z?/p,(\ — p;) as weights. 
The next extension to consider is where the stimulus or dosage is not a single 

variable but some linear combination of variables. Thus the ith level of the 

stimulus might be denoted by} 


5, = x/B 


where x; is a column vector of k variables and B is a k X 1 vector of coefficients 
presumed constant over all individuals. For example, in the question of whether 
or not to purchase a new car in a given year the x vector would include such 
variables as income, the relative prices of cars and gasoline, the age of the present 
car, and so forth. We still assume that each individual has a threshold level for car 
purchase, and we postulate a distribution f(s) over the population, where s 
indicates the threshold or minimum stimulus required to trigger a new car 
purchase. Thus the probability of a car purchase at stimulus level s, is 


on ee 
eo) 
If the f(s) distribution were normal with mean p and variance o”, then 
no") 
oO 


where F(-) again indicates the cumulative standard normal distribution. The 
observed sample proportions p, are transformed into n.e.d.’s, and the appropriate 
regression is 


y= F\(p;) = F-\(m,) + 4, 


Sr B— 
or y=- Bey = Moh ey, 
The relationship actually estimated is then 
Ye XB ee ek (10-69) 
1 , 
where Bt = —[(B, yey By] 


Since p; is still a binomial variable with mean 7, and variance 7,(1 — 7;)/nj, the 
disturbance term u, will have the same properties as above. Thus GLS may be 
applied to Eq. (10-69) with the correction for heteroscedasticity implied by Eq. 
(10-68). 


+ We are now using s (rather than x) to indicate stimulus or dosage, since we wish to use x, to 
indicate a vector of explanatory variables, in conformity with the notation in regression analysis. 
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An obvious problem with the application of this model in economics is the 
difficulty of ensuring that n, individuals are subjected to a given stimulus x’,B. The 
B vector, of course, is unknown, but an appropriate method available with large 
data sets is to classify units into subsets with given values of explanatory variables 
such as income, age of car, and so on. A second problem is that we may have little 
justification for the normality assumption underlying the n.e.d. or probit ap- 
proach. This may be explored by making different assumptions about the rela- 
tionship between the probabilities 7, and the stimulus level s; = x’B. 

The simplest alternative assumption is that of a linear relationship, namely, 


or p= xB + u, 


If this is estimated by OLS, or by GLS taking account of the heteroscedasticity in 
u, it may give a reasonable fit to “middle-range” data, but it is doomed to run 
into difficulty for extreme values of xB since there is nothing in either procedure 
to prevent estimated probabilities turning out to be negative or in excess of unity. 

The more common and more sensible procedure is to model the probabilities 
m, by some distribution function other than the cumulative normal. Perhaps the 


most frequently used is the /ogistic.} This may be formulated as 
xB 1 
e 
=—<—_ = —__ 10-70) 
7 T+ oP 1 texB é 


Clearly, 7 is constrained to the (0, 1) interval. It increases monotonically with the 
stimulus x’B, it equals 0.5 when x’B = 0, and it has a shape similar to that of the 
cumulative normal.t It is, however, simpler to work with than the cumulative 
normal. 

It follows directly from Eq. (10-70) that 


(10-71) 


that is, the logarithm of the odds ratio or /ogit is an exact linear function of the 
x’s. As before, the observed sample proportions p; = 7 + & follow the binomial 
distribution 

m(1 — %) 


Bux binomial, 2) 


We seek a relationship between the observed logits and the true logits. Letting 


ip) =n 2) 


den, “Conditional Logit Analysis of Qualitative Choice 
Academic Press, New York, 1974, Chap. 4. 
London, 1970, p. 28, Table 2.1. 


+The classic reference is D. McFad 
Behavior,” in P. Zarembka, Ed., Frontiers in Econometrics, 
+ See D. R. Cox, The Analysis of Binary Data, Methuen, 
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a first-order Taylor expansion around 7, gives 


af 
f(p,) = fm) + 8 ap hae 

Of yy ezpnnaalor uth 
a OPilan, (1-7) 
Thus 

n( 2) =x/B+u (10-72) 
1—p, . 
; 

where Nomi a) 
so that 


1 


E(u,)=0 and valu) = Soa) 


(10-73) 
The appropriate estimation procedure is then as follows. 


1. Compute the observed logits In[ p,/(1 — p,)] from the sample proportions. 
2. Carry out a GLS regression of Eq. (10-72) using the disturbance variances 
obtained from Eq. (10-73) by replacing the unknown 7, by p;. 


So far in both the probit and the logit approaches we have assumed that there 
were several observations at each level of the stimulus so that sample proportions 
could be computed. In some cases this may be infeasible and we just have a single 
observation, y = 1 or y = 0, at each xB. The scatter would then look like Fig. 
10-5. 


y 
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Fitting a linear regression of y on xB is unlikely to approximate the true 
probabilities over the middle range and gives nonsense results at the extremities. 
The logistic assumption, however, allows the derivation of a fairly simple ML 
estimator, which does not violate the constraints on the probability number. 

For each of n individuals in the sample we now observe a k X 1 vector x, of 
stimulus variables and a response variable y, (i = 1,..., 1). The scalar stimulus 
experienced by an individual is given by s, = x;B, and y, is a dummy variable, 
taking the value unity when a response is observed and the value zero when there 
is no response. The probability of a response is assumed to be logistic, that is, 


Pr(y, =) = 
a, = Pry, = 1)'= 5 

ea (10-74) 
and lg = Pr y= Oa ames 


Suppose that r responses and n —r nonresponses occur in a sample. Let us 
reorder the sample observations so that the responses come first and the nonre- 
sponses last. The log likelihood is then 


r n 
InL=YInz,+ Y (1-7) 


i=l i=r+l 
From Egs. (10-74) 
dln, e* ss athe 
Tapes ( T+ At (Lo mx, 
din(1 — 7) _ e rye 
and op we eee) 
Thus 


cp ig oes . 
= P= 9,)x; — 1X, 
Wie we 


r n 
= Lx 5m 
i=l i=l 
The ML estimates of B must then satisfy the equation 
r n 
Lx, = Lam, (10-75) 
i=l i=l 
The left-hand side is the sum of the x vectors just for the individuals displaying a 
response. The right-hand side is nonlinear in B, and an iterative nonlinear 


program is required for the estimation of B. The asymptotic standard errors may 
be obtained as follows. From Eqs. (10-74) 


Thus 
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and 


#inL Lh dl hi leva z , 
apap ~~ Xap > YC — mx x; 


i=1 


i=] 


If the x’s are treated as nonstochastic, the information matrix is 


R(B) = Ym (1 — 2)x x" 


i=l 


and the asymptotic variance matrix is R™ (B).+F 


10-6 ERRORS IN VARIABLES 


So far we have implicitly assumed that the X variables have been measured 
without error and that the only form of error in the equation has been in the 
disturbance term u. The latter has generally been thought of as representing the 
influence of various explanatory variables that have not actually been included in 
the relation. It could, of course, also have a component representing measurement 
error in the dependent variable Y, and the previous results would still be valid. 
We now pose the question of what happens if the X variables are subject to 
measurement error. We assume that the B vector represents the coefficients of the 
correctly measured X variables. Thus the model is assumed to be 


y=XBt+u (10-76) 
where X is the n X k matrix of the true (but unobserved) values of the explana- 
tory variables. The matrix of observed values is 

X=X+V (10-77) 


where V is the n X k matrix of measurement errors. If some variables are 
measured without error, the appropriate columns of V are zero vectors. Combin- 
ing Eqs. (10-76) and (10-77) gives the following relation between the observed 
variables: 


y = XB + (u— VB) (10-78) 
The OLS estimator of B in Eq. (10-78) is then 


b=8B + (X’X) 'X’(u— VB) 


Conventional assumptions about the error terms are as follows. 


+ For extensions of the material in this section the reader should refer to the works of Cox and 
Finney already cited; also to M. Nerlove and S. J. Press, Univariate and Multivariate Log-Linear and 
Logistic Models, Rand Corporation, R-1306-EDA/NIH, 1973; G. G. Judge et al., The Theory and 
Practice of Econometrics, Wiley, New York, 1980, Chap. 14; and T. Amemya, “Qualitative Response 
Models: A Survey,” Journal of Economic Literature, vol. 19, 1981, pp. 1483-1536. 
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1. The measurement errors in X are uncorrelated in the limit with the true values 
X. Thus 


bets 
plim( - Xv) =0 
and so 
plim( 7x°x) 4 plim( 7X) ks plim( *V'v) 
n n n 
=34+9 


2. The equation disturbance (plus any measurement error in Y) is uncorrelated 
in the limit both with X and V, that is, 


- sfihnen 11 seta (pleat \ peu 
plim( *V'a) = 0 and plim( —X'u) 0 


With these assumptions 
plimb = B — (2+ &) '2B (10-79) 


and so OLS estimates are inconsistent. The inconsistency is due to the correlation 
between the data matrix X and the composite disturbance term (u — VB) in Eq. 
(10-78). 

As an illustration of the result consider the two-variable model 


Y,=a+ BX, + u, 


where X, =X, + 0, 
Then 

; 1 ty X, 

eS plim( =%'%] | 
‘ -PX, —LX? 
n n 
hl ples, aes 

where p and o? denote, respectively, the mean and the variance of X. Further 


1 0 0 
Q= plim( V'v) = plim 0 +50 


since there is no error in the dummy variable for the intercept term. Substitution 
in Eq. (10-79) gives 


fa) ef ale 1 | Here 
yb |p| o2+o2| oB 
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from which 
Cy eine 


oto 1+ a7 /o 


plim(b) = B - 5 
Errors of measurement in X thus bias the estimate of B downward. The per- 
centage bias is approximately given by the error variance as a percentage of the 
variance of the X values. The estimate of the intercept is also inconsistent, and 
this result extends to the multivariate case: even if some explanatory variables are 
measured correctly, all coefficients will in general be inconsistent. 

The measurement error in the X variables thus poses a possibly serious 
estimation problem, and alternative estimators are required. There are two main 
types of estimator described in the literature. One is based on instrumental 
variables of various kinds and the other on ML methods, buttressed with fairly 
strong assumptions about the covariance matrix of the measurement errors. 
Before describing the estimators it is worth emphasizing the possibility that in 
certain circumstances economic agents may react to the measured values rather 
than the true values of economic variables. Firms may base investment decisions 
on some extrapolation of national income trends and in so doing will use the 
latest national income statistics complete with such errors as they contain, If 
decision makers respond to measured data, then the measurement error is 
irrelevant and our previous techniques will be valid. 


Instrumental Variable Estimators 


The IV method requires a matrix Z of variables which are correlated with the true 
X but uncorrelated in the limit with the measurement errors V. The IV estimator 
is 


by = (ZX) 'Zy (10-80) 
which will then be consistent and have asymptotic variance matrix 
asy var(b);y = 02(Z’X) 'Z’'Z(X’Z) 
To illustrate some of the instrumental variables that have been suggested, 
consider first the two-variable model, which may be written 
¥,=a+ BX, + (u,— Be,) 
Suppose there is an even number of sample observations. Define Z as 
1 1 1 ‘ies 1 
Z = 
=i 1 1 ot tit 


where the elements in the second row are plus or minus | according to whether 
the corresponding value of X is above or below the median X value. Application 
of Eq. (10-80) then gives 


1 


ay n nX 7% nY 
= Np) Amey, = 
by 0 5(% — Xi) 5("=") 
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where Xx, and X, denote the means of the values above and below the median and 
Y, and Y, the means of the corresponding Y values. The estimator of the slope is 


by = Bade uf 
Yor, Maran 

and the intercept is estimated by io 
ay = Y—-bX 


This procedure amounts to partitioning the data into two subsets by the median 
X value and passing a straight line through the mean points (X,, Y,) and (X, Y). 
If n is odd, one should omit the central observation before beginning the 
computations. This estimator was first proposed by Wald.} Under fairly general 
conditions the Wald estimator is consistent but likely to have a large sampling 
variance. Bartlett has shown that the efficiency may be increased by dividing the 
X values into approximately three equally sized groups, the first containing 
the n/3 smallest X values and the third the n/3 greatest X values.¢ Omitting the 
central n/3 observation, the slope is estimated by 

Hilde 

x,- Xx, 

and the intercept as usual by ayy = Y —bX. 

Extension of the grouping methods of Wald and Bartlett to more than one 
explanatory variable is cumbersome and tedious. A somewhat different IV 
estimator suggested by Durbin does not have this drawback.§ The suggestion is to 
rank the X values in ascending order and then define the Z matrix as 


In Ob 16 O\ 7 

1 ep es eer | 
where the second row indicates the rank values of the X’s.] Substitution in Eq. 
(10-80) then gives the estimate of the slope as 


Trai, 
by = oy (10-81) 


where y, = Y, — Y and x, = X;—- X. The estimate of the intercept turns out to be 


U= 


YLiX, — XLIY, 
ay = ios ae (10-82) 
” Annals of 


+A. Wald, “The Fitting of Straight Lines if Both Variables Are Subject to Error,” 
Mathematical Statistics, vol. 11, 1940, pp. 284-300. 

+M. S. Bartlett, “Fitting a Straight Line when Both Variables Are Subject to Error,” Biometrics, 
vol. 5, 1949, pp. 207-212. It is easily seen that this is equivalent to making the second row in Z/ 
consist of equal numbers of zeros and plus and minus ones according to the ranks of the X values. 

§ J. M. Durbin, “Errors in Variables,” Review of the International Statistical Institute, vol, 22, 1954, 
pp. 23-32. 

{ With this formulation plim((1/n)Z’Z) would not exist as required for the consistency of the IV 
estimator. However, if the second row is replaced by 1 /n,2/n,..., 1, the condition will be satisfied 
and the same estimates as in Egs. (10-81) and (10-82) will result. 
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This procedure can easily be extended to replace additional explanatory variables 
by their ranks. Asymptotic standard errors may be estimated by the usual IV 
formula. It is likely that the instrumental variables for these grouping schemes 
will not be highly correlated with the X variables. Thus the IV estimators will 
probably have fairly large standard errors compared with those of OLS, which is 
the price that has to be paid for consistency. 
To illustrate ML methods, which depend on some specific prior knowledge of 
the disturbance variances, consider again the two-variable model 
Y¥,=a+ BX, +4, 
é t=1,-+-,n (10-83) 
with X,= X,+ 2, 
where X denotes the observed value and X the true unobserved value. The u term 
is an amalgam of the conventional disturbance term and any measurement error 
in Y. Thus the model might be written equivalently as an exact relation between 
two variables, both subject to error, that is, 
Y=at+ BX, 
; . “ (10-84) 
with ¥=Y¥,+uy and xX, = X, + », 
The errors u, and v, are assumed to follow normal distributions with the following 
properties: 


E(u,)=E(v,)=0  E(u?)=02 E(v?)=0; — forallr 
E(uu,) = E(v,,)=0  s#t (10-85) 
E(u,v,)=0 forall s,t 


Thus the errors are taken to be serially and mutually independent. The relation 
between the errors and the true X, Y values depends on the nature of these latter 
variables. We will distinguish two cases. 


Case 10-1. X,, X,,..., X, are a set of given numbers. This case has two possible 
interpretations. One is that the set of X’s can be held fixed in repeated sampling. 
This situation would be of little interest, even in the experimental sciences, for if 
the X’s are truly unobservable, how can the experimenter know that they have 
been held constant in repeated trials. The more useful interpretation, especially in 
the social sciences, is the one treating the X’s as fixed amounts for making 
inferences conditional on the set of X’s underlying the sample observations. 


Case 10-2. The X’s are random drawings from a normal distribution with mean }4 
and variance o”. This is hardly a plausible description of the generating mecha- 
nism of most economic variables, but this case leads to the simplest estimating 
equations and there are interesting parallels between the estimators in the two 
cases. 

If the Xs are fixed, then so are the Ys, and the assumptions already made in 
Eq. (10-85) would ensure zero covariances between errors and true values. 
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Specifically 
E(X,u,) = E( X,v,) = E(Yu,) = E(¥,e,) =0 — foralls (10-86) 
If, however, the assumptions of Case 10-2 apply and the X’s and hence the Y’s 


are random variables, the conditions in Eq. (10-86) would constitute an additional 
set of assumptions. 


Estimation of Case 10-2. Given the assumptions listed above, the observed X, Y 
values would come from a bivariate normal distribution which is fully determined 
by the following five parameters: 
E(X) = E(X)=p 
E(Y)=E(Y¥)=a+ Bp 
var(X) = 0? + o2 (10-87) 


var(Y) = 07 + 07 = Bo’ + 0, 


cov( X,Y) = cov( X, ¥) = Bo? 


The ML estimates of the parameters on the left-hand side of Eqs. (10-87) are 
given by the corresponding sample statistics, and we then hope to solve the 
resultant equations for estimates of the parameters of the model. The estimating 
equations for a, B,... are 


X= 

Y=a+ Bj 
m,,=6> + 6 (10-88) 
m,, = B6* + 6} 
m,, = Bd? 


where the m’s indicate second-order moments of the sample data, that is, 


1 n fee ae 
my 4 (4 EMF) 
tell 
and so on. The dilemma with Eqs. (10-88) is that there are six unknowns but only 
five equations. Only p is identifiable and estimable. There is no hope of estimating 
the other parameters unless additional information can be brought to bear. Three 
possible sources of additional information are conventionally considered. 


1. Knowledge of 02. It is becoming more common for economic statisticians to 
indicate the approximate degree of error in major statistical series, Thus in 
some circumstances it may be possible to gauge the probable error in the 
explanatory variable and to replace o2 by an estimate s;. The third and fifth 
equations in Eq. (10-88) then give 

p= ola: (10-89) 


Myx Sp 
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Thus the sample variance in X is reduced by the estimated error variance 
before dividing into the covariance term. If there were zero measurement 
error in X, Eq. (10-89) reduces to the slope of the OLS regression of Y on X. 
The first and second equations in Eqs. (10-88) give 


a= Y-BX (10-90) 


2. fata oe a2. This is perhaps a less likely situation than prior knowledge 
of o? since o, incorporates both the measurement error in Y and also the 
conventional equation error. If, however, we have a prior estimate 52) tHe 
fourth and fifth equations in Eq. (10-88) yield 


ay 
Aes (10-91) 


xy 


If s? were zero, this estimate becomes the reciprocal of the slope in the OLS 
regression of X on Y. 

3. Knowledge of the ratio \ = 02/02. After some manipulation the last three 
equations of Eq. (10-88) now give 


m,,B? — (my, — Am,,)B — Am,, = 0 (10-92) 
with roots 

(my, —Am,,) + V(myy — Am.) + 4Am2, 
P yi 2m (10-93) 


xy 


The sign of B must be the same as that of m,,. This will be so only if the 
numerator of Eq. (10-93) is positive, and that in turn will be so only if the 
positive sign before the square root is taken. Thus the estimator is 


(m,, — Am.) + /(m,, — Am ? + 4dm?, 
vy yy xx xy 
B= ; (10-94) 
My, 


Estimation of Case 10-1. We now assume that there is a set of unknown values 

X,, X,,..., X, underlying the sample data, and we wish to make inferences 

ponditonal on this set. We still retain assumptions (10-84) and (10-85). The log 
likelihood function is 

n 

In L = constant — “Ino? = “Ino? poy X 

Daye need ak i 


se ie wa (10-95) 


The major difficulty is a the likelihood function now contains n + 4 parame- 
ters, namely, a, 8, 62, 62, and the n values of X. Straightforward maximization of 
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Eq. (10-95) leads to unacceptable results.+ The situation cannot be rescued by 
increasing the sample size since this automatically increases the number of 
unknown X’s. It can, however, be improved by the use of prior knowledge, 
typically that \ = 62/0? is known. Making this substitution in the log likelihood 
and carrying through the maximization process gives exactly the same quadratic 
in B as already derived in Eq. (10-92) for Case 10-2. Thus the B defined in Eq. 
(10-94) is the ML estimator for Case 10-1. 

The range of A is zero to infinity. The extremes correspond to the two simple 
regressions in Case 10-2, information 1 and 2. The estimator defined in Eq. 
(10-94), which is based on a known J, will lie between the two OLS regression 
lines. This estimator is a consistent estimator of 8. Kendall and Stuart show that 
a consistent estimator of a, is provided by 


2n aN 


=i ae — 2Bm,, + B?m,,) (10-96) 


The hypothesis Hy: = 0 may be tested by computing the sample correlation 
coefficient 
m 


xy. 


/mxMyy 

and using the result that, under the null hypothesis, 
pv 2 ~t(n-2) 
vl-r? 


The computation of a confidence interval for 8 is somewhat more complicated.§ 
Define the angle 6 by # = tan6, or 6 = arctan B. The 95 percent confidence 


interval for @ is given by 


r= 


2 W/2 
ek a: (10-97) 
(n= 2)[(m..—m,,)? + 4m, 


a losil i 
O+ yaresin 2to.025 


The corresponding limits for 8 are the tangents of these angles. The assumptions 
required for the development of Eq. (10-97) render this essentially a large sample 
method, and, of course, all the above rests on exact knowledge of A, which is not 
often likely to be forthcoming. The technique may be extended to a multivariate 
regression if the investigator has knowledge of the ratios of all the error variances. 
Details are given in the Kendall and Stuart treatise. 


+See M. G. Kendall and A. Stuart, The Advanced Theory of Statistics, vol. 2, Griffin, London, 
1961, pp. 383 ff. 

$M. G. Kendall and A. Stuart, op. cit., pp. 385-386. 

§ M. G. Kendall and A. Stuart, op. cit., pp. 388-391. 
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10-1. Prove Eq. (10-5) by the method suggested in the text. [ Hint: Remember that expressions such as 
x/(X/_,X,_,)'x, are scalars and may be moved back and forth in matrix formulas, that is, 
cAB = AcB = ABc, where ¢ is a scalar and A and B are matrices.] 
10-2. Relation (10-5) is a special case of a general result given by Plackett.j His problem and method 
of proof may be stated as follows: 

First sample data yy, X,(m X k) 


Additional data ¥2,X,(m x k) 
Complete sample y= 3! | X= 3 
n xX, 


The problem is to find the simplest computational way of updating least-squares statistics from the 
first sample to the complete sample. 


Method: Define 

R= X,(XiX1) "Xd 
and R=X,(X’x) 'X) 
Prove that 


R,R=R,-R 
and hence that 
(Ip, + Ry)(n ~ R) = Tn 
Then show that 
(Ip + Ry) 'X3 (XX)! = X2 (XX) 
and thus that 
(Xm) = (KK) = (RX) Xl RD XO 
Finally show that this result yields Eq. (10-5) when X, is just a row vector of observations on one 
additional sample point. 
10-3 For the recursive residuals defined in Sec. 10-1, prove 
RSS, = RSS,_, + 97 
[ Hint: Express y, — X,b, as y, — X,b,_; — X,(b, — b,_). Partition 
Y-1 X41 
% Pal and x,-[% 


and using Eq. (10-6) show that 


RSS, = (y= X,_-1)'(, = Keds) —¥(KEX,) XC = Xb)" 
Applying the partitioning again and using Eq. (10-5) gives the desired result.] 
10-4 Take a simple time series and verify that the restricted estimation of Eq. (10-17) yields the same 
point estimates of the « and 8 parameters as those derived from the estimated coefficients of the spline 
function (10-14). 


+R. L. Plackett, “Some Theorems in Least Squares,” Biometrika, vol. 37, 1950, pp. 149-157. 
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10-5 For the disturbance term in Eq. (10-21) make the following assumptions: 
E(uj,) = 9, 
E(u uy) = %; 
We = Piri tie [Pil <1 


iid(0, 0?) 


fir 
Derive var(u) and discuss how a feasible GLS estimator of the parameters of Eq. (10-21) might be 
constructed, 
10-6 Show that in Table 10-2 


{EE Hy} - 


p(m—\) at 


and hence show that it is a biased estimator of 07 = 07 + 07. 
10-7 For a two-way error component model assume 
Pee ee Ve a on ey Ad Ea 
where 1, is a unit-specific time-invariant effect, A, is a period-specific unit-invariant effect, and, ¢,, is a 


random disturbance at observation i, f. 
The y,, A,, and e,, are random variables having zero means, independent among themselves and 


with each other, with variances a2. o}, and 02, respectively. Show that 


V = E(w’) = o?[pA + wB + (1 ~ p— w)I pm] 


where A=1,04J,, 
B=J,@1,, 
2 2 
4g, ° 
Pietra ita peat ets 
o=o2+ a+, p= w 
* ty e o a 


and J, is an m X m matrix of ones. 

10-8 For the disturbance u,, defined in Problem 10-7 develop the ANOVA table similar to Table 10-2. 
Hence indicate possible estimators of 97, 0%, and o,. 

10-9 Establish the result stated in Eq. (10-52). 

10-10 Prove Eq. (10-57). 

10-11 Derive formulas (10-81) and (10-82). 

10-12 Consider the following regression model for a sample of panel data: 


Yy =a) t+ aX, + aX2,; + a3X3); + &, 


., ¢ (time periods), and the X’s are exogenous variables. 


i= 1,2,..., m (panel members), j = 1,2,-- ) 
ndently distributed with zero mean and constant 


The ¢,, are assumed to be normally and indepe! 
variance for all i, j. : { 
(a) If X3,, is not observed and an investigator regresses ¥,, on just X),, and X3,, with a constant 
term in the regression, what is the bias in the least-squares estimate of az? If the algebraic sign of the 
simple correlation coefficient for X3,, and X,, were known, is this sufficient information to determine 
the algebraic sign of the bias? If not, explain what information is required to determine the algebraic 
sign of the bias. ; : ‘ 
(b) If the unobserved independent variable X3,, is assumed to satisfy X3,; = X3, for all j and is 
assumed to be nonstochastic, explain how to obtain estimates of a, and @, and their associated 


standard errors. 
(University of Chicago, 1977) 
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10-13 Consider the following errors in variables model: 
yhaatoxt §=1,.-,N50= 1,2 


Xp = Xt bin 
Ee; = Ee, = E(xheis) = E(xneis) = Eves) = 9 
E(yireis) = E(eutis) = 9 
E(e) =o E(ex) = 92 E(e,€;2) = Po E(ejei2) = Poe 
E(x#?)=02  E(xAxh) = ey foralli;¢= 1,2;5 = rey) 


.,N; ¢= 1,2. Let b be the IV estimate of b from a cross-section 
as the instrument, Let b be the IV estimate 
p,, and p, are all 


Xiys Yu are observed for i = 1,.. 
regression using data from the second time period and x,; 
of 6 using the same cross section hut with y,, as the instrument. Show that if p,. 


positive, then plim(b) < 6 < plim(6). 
(UL, 1981) 


CHAPTER 


ELEVEN 
SIMULTANEOUS EQUATION SYSTEMS 


So far our interest has centered mainly on the inference problems associated with 
a single equation, although there was some discussion of groups of equations in 
Chap. 8. Economists, of course, often focus on a single equation, such as an 
aggregate consumption function, a demand function for gasoline, a wage-change 
equation, and so forth. However, economic theory teaches that such equations are 
embedded in a system or subset of related equations. Thus one must examine 
whether the presence of these related equations has any implications for the 
estimation of the focus equation. More importantly, the estimation of a complete 
system of equations is often an important practical problem, whether the objec- 
tive is to test economic theories about the nature of the system or to use the 
complete system to make joint predictions of a set of related variables. 


11-1 SOME ILLUSTRATIVE SIMULTANEOUS SYSTEMS 


In this section we will consider a few very simplified systems in order to illustrate 
the main problems that arise, and then in subsequent sections we will give a more 


general and formal treatment. 
Consider first an even simpler income determination model than the one 


outlined in Chap. 1. This one consists solely of a consumption function and the 
national income identity, namely, 
G=a+ BY +u, is!) 
eee (11-2) 
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where C= aggregate consumption expenditure 
Y=national income 
Z=nonconsumption expenditure 
u=a stochastic disturbance term 


We regard the model as explaining the values taken by C, and Y, conditional 
on Z,. Thus C and Y are classified as endogenous variables and Z as an 


exogenous variable. We will make two assumptions, namely: 


1. u~ NO, 021) 

2. Z and wu are independent, which will be satisfied if either Z is a set of fixed 
numbers or Z is a random variable distributed independently of u. Z could be 
taken as representing autonomous investment and government spending 
controlled by some central authority. The model does not discuss the determi- 
nants of Z. 


The reduced form of the model is} 


G= 


where v, = u,/(1 — B), so that 


=a. — 
(1-B) 


It is immediately obvious from Eq. (11-4) that v,, and hence u,, influences Y,. In 
fact, 


oy 


(1- By 


Thus the application of OLS to the consumption function (11-1) would yield 
inconsistent estimates. The nature of the inconsistency is illustrated diagram- 
matically in Fig. 11-1. The line a + BY shows the relation between C and Y if the 
disturbance u were zero. The line Y — Z’ illustrates the identity (11-2) for a 
specified Z’. The equilibrium of the system would then be indicated by the point 
P,. Imagine now that Z is held constant at Z’ and that the disturbance takes on 
various positive and negative values in some finite range.§ The economy would 


etl ultae 
plim( =PY,« = plim( “re?) = 


+ As shown in Chap. 1, the reduced form is obtained by solving the model so as to express each 
current endogenous variable solely in terms of exogenous variables and lagged endogenous variables. 

If necessary, review the discussion of consistency in Sec. 7-2 and illustrations of inconsistency in 
the presence of lagged variables in Sec. 9-2 and in the presence of errors of measurement in Sec. 10-6. 

§ The range is, of course, infinite for a normally distributed disturbance, but the finite range is a 
convenient assumption to keep the diagram simple. 
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ti Y= z so ee 


a+ BY 


Figure 11-1 


then trace out points in successive periods in the range P, to P, along the Y ~ Z’ 
line. If Z never changed from Z’, these would be the only points ever observed for 
this economy, no matter how many observations were taken. The estimated 
regression of C on Y would coincide with the line Y — Z’, and the estimated 
marginal propensity to consume would be unity, no matter what the true B happened 
to be. Now suppose that over a large number of time periods Z ranges between 2’ 
and Z”, Observations on C and Y would then fill in the parallelogram P,P, P; P,. 
The least-squares regression of C on Y minimizes the sum of squares of the 
residuals measured in the vertical (that is, C) direction. Thus in the limit the OLS 
line will tend to pass through the points P,, P;. The estimated slope will now be 
less than unity but will still be greater than the true B, so that the asymptotic bias 


is positive.} 


Instrumental Variable Estimation 

We saw in Sec. 9-2 that the use of suitable instrumental variables can produce 
consistent estimators. The obvious instrument in the present model is the Zs 
variable which, by assumption, is independent of u, and by Eq. (1 1-4) will be 
correlated with Y. Applying the IV estimator defined in Eq. (9-60) to this model 


gives 
ay = C-byY (11-5) 
and by = = (11-6) 


+ See Problem 11-1. 
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where c, y, and z denote deviations from the sample means. From Eggs. (11-3) and 
(11-4) we may derive 


Thus, provided 
eal Yeo Ae a 
plim( 20) =0 and plim( Z } Hy 
a finite number, 
plim( yy) = B 


and hence plim(a,,) = « 


Indirect Least-Squares Estimation 


The above development already contains a clue to a second estimation principle, 
that of indirect least squares (ILS). Looking at the reduced-form equations it is 
clear that they satisfy the assumptions under which OLS estimators are consistent 
(and indeed best linear unbiased) so that 


Lez : ; 4 
s is a consistent estimator of B 
PES sei) 
Lyz : é ' l 
and 2003 is a consistent estimator of 
xz bis f 
which suggests taking the ratio 
Lez, Lyz ? 
bis = Ee = = as an estimate of B 


The principle of ILS is to estimate reduced-form coefficients by OLS and then to 
compute structural coefficients by an appropriate transformation of the estimated 
reduced-form coefficients. We see immediately that in this case 


Two-Stage Least-Squares Estimation 


A third estimation principle is that of two-stage least squares (2SLS). It starts 
from the problem of Y, and u, in Eq. (11-1) being correlated. The first stage is to 
regress Y on the exogenous variables in the model, which in this case are Z, and a 
dummy variable that is always unity to allow for the intercept term. This 
reduced-form regression yields an estimated Y series, which it is hoped will 
display less correlation with the wu series than does the original Y series. We may 
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write Eq. (11-4) in deviation form as 
y, = 82z, + v, 


where 5 = 1/(1 — B), and we have also omitted © as it does not affect the 
subsequent derivation. The regression values are then given by 


Thus Lyu = 6Lzu + pvr. Lzu 
X22 


On the assumptions made earlier, 
accufch it 1 
plim{ E20} = plim( =u) =0 


so that in the limit ¥ is uncorrelated with u. In the second stage C is regressed on 
¥ to estimate @ and 8, that is, Eq. (11-1) is reformulated as 


C= a+ BY, + [u, + A(Y,- ¥)] 


with C, as the dependent variable and Y, as the explanatory variable. The 
disturbance term is shown in square brackets. From the OLS regression of Y on Z 
it follows that Y, will have zero correlation in the sample with the residual 
) Gia Y,, and we have just shown that ¥, is uncorrelated in the limit with u,. Thus 
Y, is uncorrelated in the limit with the combined disturbance term [u, + B(Y, - 
Y,)], and the 2SLS estimators will be consistent. The 2SLS estimate of the slope B 


1s 


pot bas Loy 2 b¥ez _ Lez. bz? Lez 
2SLS — ry? © 822 -L22 Lyz  Lyz 


Thus we see that in this case all three principles of estimation, IV, ILS, and 2SLS, 
would yield identical consistent estimates. 

The two-equation model of Eqs. (11-1) and (11-2) is the simplest possible 
simultaneous equation model, consisting of just one stochastic behavioral equa- 
tion and an identity, but that is enough to generate a dependence between the 
explanatory variable and the disturbance in the structural relation, rendering OLS 
inconsistent. More complicated models may be expected to generate further 


problems in addition to those already encountered. 
Consider next a two-equation model in which both equations are stochastic 


behavioral relations. With a slight change of notation we write 
Yur + Birdae + Mn = Mie 


Boy Vir + Yae + Yor = Mae 
In this and subsequent models lowercase letters denote the actual values of the 


= Lyons (11-7) 
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variables and not deviations from sample means. We will reserve the letter y for 
endogenous variables so that y,, denotes the rth observation on the ith endoge- 
nous variable. Likewise x,, will denote the ‘th observation on the jth exogenous 
variable. The structural parameters 8 and y also have two subscripts, the first 
indicating the equation and the second the variable to which it is attached. 

Model (11-7) would be a conventional demand-and-supply model if y, 
denotes price, y, denotes quantity, and we impose the restrictions 


Bn>O By, <0 


so that the first equation represents a downward sloping demand curve and the 
second an upward sloping supply curve. We would also want to impose an 
additional restriction y,, <0 to ensure a positive intercept for the demand 
function. If the disturbances in period ¢ were both zero (u,, = 0 = w2,), the model 
would be represented by the D, S lines in Fig. 11-2, and we would observe the 
equilibrium price and quantity indicated by yf, y¥. Nonzero disturbances shift 
the D, S curves up or down from the position shown in Fig. 11-2. Thus a set of 
random disturbances would generate a two-dimensional scatter of observations 
clustered around the y#, y} point. 

A fundamentally new problem now arises. Given this two-dimensional scatter 
in price-quantity space, demand analysts might fit a regression and think they 
were estimating a demand function. Supply analysts might fit a regression to the 
same data and presume they were estimating a supply function. “General 
equilibrium” economists, wishing to estimate both functions, would presumably 
be halted on their way to the computer by the thought, “How can we estimate 
two separate functions from one two-dimensional scatter?” The new problem is 
labeled the identification problem. It is concerned with the question of whether any 
specific equation in a model can in fact be estimated. It is not a question of the 
method of estimation nor of sample size, but of whether meaningful estimates of 
structural coefficients can be obtained. On the assumptions made so far neither 
equation in Eqs. (11-7) is identified. A regression fitted to the scatter in y,, V2 
space is not an estimate of either the demand or the supply function. 


a 
i 
S: Bay, + ¥2 + 721 = 0 
Si a ee 
| 
| D:¥\ + By¥2 + 11 = 0 
| 
| 
| 
! 
ro) ile ~ Jo 
4 Figure 11-2 
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The identification problem may be investigated by looking at the relation 
between the structural and the reduced forms of the model. The reduced-form 


equations corresponding to Eq. (11-7) aret 
1 
Yu = alm + Bizyn) + (ue - By2t2)] 
/ (11-8) 
Sitan 4 (Baim = Yar) + (Boyt, + u>,)] 
where A = 1 — B,,,). The first term on the right-hand side of each equation is a 
constant. Thus we may write the reduced form more simply as 
Vie = Pr + Oe (11-9) 
Voy = Hy + On, 
where 
pitied SH Byrn 
By A 


= Bain = Yar 
: 4 (11-10) 
— 4 Bitar 
On A 


Batty, + Ure 


If we postulate that 
E(u,,) 
E(u,) = =0 
Be a 
; % %K 
and E(u) = 2 = ee , 
then 
E(y,)=0 
6,, + Bin — 2829 
var(o,) = B( 8%) = Be 


83,01, + %22 — 2Bo1912 
var(o,) = E(v3,) => ut 2 


and 
=BnG — Bim +(1 + BiB) or 


cov( 0), 02) = E( v1) = RM 


+ Here there are no lagged endogenous variables and the only exogenous variable is the dummy 
variable x,,= 1 for all t, which is required to take care of the intercept term in the structural 


equations. 
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It also follows from Eggs. (11-9) that 


E(y)=m 
E(y2) = ba 
var( y,) = var(v,) (11-11) 


var( y,) = var( v2) 
cov( ¥,, ¥2) = cov(v,, v2) 


Sample data on y,, y, can only yield estimates of the five parameters in Eqs. 
(11-11). These in turn are functions of the seven parameters of the structural 
model, namely, B)5, 82), Yj), Yo1+ 911» 922» and 62. On the assumptions made so far 
the structural parameters are unidentifiable. 

As a numerical illustration of this situation suppose the true structure 
corresponding to Eqs. (11-7) is 


y, +2y,- 10 =u, 
—3y,+y,.+2=u, (11-12) 
0), = 0, = | 0. = 0.5 


Equations (11-7) define a model, and a structure like Eqs. (11-12) is obtained from 
a model by assigning specific numerical values to the 8 and y parameters and also 
to the variances and the covariance of the u’s. Solving this structure for Eqs. 
(11-9) gives. 


yy=24+0, 
n= 440, 
u, — 2u, 
where v; 4 e707 
3u, + u 
oe 2 
Thus E(y,) =p, =2 
E(y)) =". =4 
var( y,) = var(0,) = 45 (11-13) 
13 
var( y)) = var(v,) = 49 
=15 
Cov( yi; Ya) = cov( 0, ¥») = e— 


The true structure (11-12) is, of course, known only to the “deity” who sets the 
economic system in motion. Now suppose that one of the deity’s vice-presidents 
tinkers with the institutions in an attempt to confuse the econometricians of the 
world and concocts a new structure by the following rule, where (1) and (2) 
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indicate the first and second equations in Eggs. (11-12), 
New first equation = 4(1) + 1(2) 


New second equation = — (1) + 3(2) 
This yields the structure 
Y, + 9, — 38 = uF (11-14) 
— 10y, + y. + 16 = uz 
where 
u¥ = 4u, + uy 
us = —u, + 3u, 


The new structure obeys the same a priori constraints on signs as Eqs. (11-12). 
Solving this structure for Eqs. (11-9) gives 


y=2+oF 
Yaar vy 


where 
ut —9ux  u,—2u 
PSG Cc vin Pe a 
4 91 7 a 
10ut + us 3u, + uy 
2 eS 
"2 91 7 % 


Thus the five parameters of the reduced form E(y,), E(¥2), var(yi)s var( y,). and 
cov(y;, Y2) are identical for the two different structures and indeed for all 
structures derived by taking linear combinations of the original structural equa- 
tions. 
It is instructive to see what type of further information might help identify 
one or both equations of this model. There are three basic possibilities, namely, 
(1) restrictions on the B and y parameters, (2) restrictions on the 2 matrix, and (3) 
respecifications of the model to incorporate additional variables. To illustrate the 
first category, suppose the supply function is presumed to go through the origin. 
The a priori restriction is thus 
Yn = 9 


This reduces the number of structural par 
teduced-form parameters is five, as before, 
structural parameters can be identified. However, 
in Eqs. (11-10) gives 


ameters to six, but the number of 
so that it is still not clear that any 
making the substitution y,, = 0 


ford 
By A 
fo Ban 
Bo ir od 
cElor) 


so that By 
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showing that f,, can be determined from a knowledge of the reduced-form 
parameters and also suggesting a possible estimator as B., = —J,/y,. This 
restriction would enable the supply function to be identified, but the demand 
equation remains unidentified. Linear combinations of the demand and sup- 
ply equations would be statistically indistinguishable from the original demand 
equation. However, any linear combination that assigns a nonzero weight to the 
demand function will fail, with probability 1, to have a zero intercept and thus 
will not look like the new supply function. 
Now suppose we return to Eqs. (11-7) and impose the restriction 


var(u,) = 0,, =0 


This also implies that o,, = 0. Looking at Eqs. (11-11) we now find 


Biron 0. 
vary) = Pi 
var(y) = 

cov 9, Ja) = Aa 
so that 
pee var(y,;) _ —cov(y1, 2) 
i var( y2) var( y>) 
_ _=var( yi) 
cov( y, ¥2) 


and thus the slope of the demand function is identified. Taking expectations of 
the demand function in Egs. (11-7) gives 


Y= ~~ Byte 


and substitution for #, and #, from Egs. (11-10) verifies that this relation holds. 
Thus y,, and 8, can both be expressed in terms of the parameters in Eqs. (11-11), 
and the demand equation is identified. This case is pictured in Fig.11-3. The 
combination of 6,,=0 and o,, +0 generates a set of observations on the 
demand function. 

A less extreme version of this case would occur if o,, were “small” as 
compared with o,,. The scatter of observations would then tend to be con- 
centrated around the demand function rather than lying exactly on it. However, 
knowledge about the relative sizes of disturbance variances is not likely to be 
generally available, though a possible reason for a large o,, might be the omission 
of important explanatory variables from the supply function in Eqs. (11-7). The 
appropriate remedy is the respecification of the supply function to include such 
variables. In practice the demand function should also be looked at since the 
simple two-variable model of Egs. (11-7) is hardly a realistic specification with 
which to commence empirical work. 
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2 aa *y2 Figure 11-3 


Consider now a respecification of Eqs. (11-7) which is, say, 
VY + Byrn + M%1 + N2%2 Sea | 
By yi + Yt Yrr%1 + Y23X3 + Yoa%4 =U2 


(11-15) 


where we still retain the restrictions B,, > 0 and B,, < 0 to conform with the 
demand-and-supply analogy. The variable x, could be taken as a dummy with a 
value of unity in all periods to cater for the intercept term, x, might represent 
income, which is expected to influence demand, and x, and x, would represent 
variables influencing supply. The reduced form of this model is 
x; 

a 4 (=r + Bava) Ye Bi2%23 ag ely fe 

Yo} Al (Buti Ya) Baiti2 ~%3 Ya °2 
where A = | — B,B>, and the v’s are given in Egs. (11-10). Let us denote the 
reduced-form coefficients by %j Ce = 2s 128 4), Tes clear that the 
structural coefficients can be obtained from the reduced-form coefficients. For 
example, 


py = 
21 
™2 
Bae amg AT Aa 
e TM; M4 


and having found the B’s, the y’s can be obtained from 7, and 7,. Leaving the 
disturbance parameters aside, there are eight reduced-form coefficients and just 
seven structural coefficients. The imbalance is reflected in the existence of two 
alternative (but equivalent) expressions for B,>- This indicates, however, that we 
may expect the ILS technique to run into trouble here since the estimated 
reduced-form coefficients will in general not satisfy the equality 73/73 = ™4/T24 
that holds for the true coefficients. 
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Further investigation of identification and estimation problems by way of 
specific models of increasing size and complexity would be inefficient. We now 
move to a more general and more formal treatment, which can then be specialized 
to deal with particular cases. 


11-2 THE IDENTIFICATION PROBLEM 


Let us assume a linear model containing G structural relations. The ith relation at 
time ¢ may be written 


Badu + °° + Bic Yee + Yau + Mie Ke = Mir (11-16) 
hi eee Grits t1,...08 


where the y,, denote endogenous variables at time ¢, and the x,, indicate exogenous 
variables (current or lagged) and may also include lagged endogenous variables.+ 
The latter two groups constitute the class of predetermined variables. The model 
may then be regarded as a theory explaining the determination of the G jointly 
dependent variables y,, (i = 1,..., G; = 1,..., m) in terms of the predetermined 
variables x,, (i = 1,..., Ki; (= 1,...,) and the disturbances u,, (i = 1,..., G; 
t= 1,..., n). The underlying theory will in general specify that some of the , y 
coefficients are zero. If it did not, all the equations in the model would look alike 
statistically, as in Eqs. (11-7), and no equation could be identified. As mentioned 
earlier, the lowercase letters denote actual values of the variables and not 
deviations from arithmetic means, and setting one of the x variables at unity 
caters for a constant term in any equation that requires it. 
The model may be written in matrix form as 


By,+Tx,=u, t=1,...,” (11-17) 
where B is a G X G matrix of coefficients of current endogenous variables, I is a 


G X K matrix of coefficients of predetermined variables, and y,, x,, and u, are 
column vectors of G, K, and G elements, respectively, 


By Bp is HeBie. Yn = Yi “os ONK 
B= By, By mee” Bag T= | ¥2) Y22 ae Yor 
Yor Ye2 Yox 
mt) uy, 
Nar Ua, 
: a u=|. 
Not Xkr Wer 


It is plausible to assume that the B matrix is nonsingular since, if it were not, one 


+ Notice that for the moment we have not normalized the structural equations by setting any of the 
B coefficients at unity 
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or more of the structural relations would merely be a linear combination of other 
structural relations, thus being redundant, or, if the rows of the matrix did not 
obey the same linear restrictions as the rows of B, the G structural equations 
would be inconsistent. Assuming, therefore, that B~! exists, the reduced form of 
the model is 


y,=[Ix,+y, ¢=1,....2 (11-18) 


where 
f1=-B-Tr and y,=B"'u, (11-19) 


The II matrix is of order G X K and thus contains GK elements. The B and 
matrices contain at most G2 + GK elements. There is thus an infinity of B and 
structures corresponding to any given IT matrix. 

The identification problem arises because the most that can be determined 
from observational data on y, and x,(¢ = 1,...,")isa knowledge of the elements 
of II and the elements of the variance-covariance matrix of the v’s. This may be 
seen in a number of ways. The reduced form Eqs. (11-18) show explicitly that the 
model provides an explanation of y, conditional on x, and on the disturbance 
vector v,. From Eqs. (11-19) it is clear that the stochastic properties of ¥, depend 
on the assumed stochastic properties of the structural disturbance vector u,. 
Assuming E(u,) = 0 for all s then givesT 

E(y,|x,) = Hx, 


Thus the mean of the conditional distribution of y,, given X,, depends solely on 
the II matrix. A finite sample of observations (y,,X,5 ¢ = 1,..., n) will yield some 
estimate [1, which will deviate from the true IT due to the fluctuations of random 
sampling. Suppose, however, that we dispense with sampling problems by assum- 
ing that an infinitely large sample of observations can be made available. In 
general the true IT may then be determined with any desired degree of precision. 
This is all that can be afforded by the sample data. Thus knowledge of B and 
can only come from knowledge of II. 
To see the same point in a likelihood context, let us assume 
u, ~ (0, 2) 


and also that the u, vectors are serially independent. It then follows fr 
(11-19) that 


‘om Eqs. 


Yom N(0,2) 


Q=B-'SB' (11-20) 
‘orm equation (11-18) 


where 
and the y, are serially independent. From the reduced-f 


p(y,ix,) = p(w) = (2a) 181 'Zexp(— av" 7) 


+ When x, contains lagged y values. this expectation has to be read as conditional on these lagged 


endogenous values. 
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Thus the likelihood of the sample y’s conditional on the x’s is 


# Hp bea 
L = P(YisYos---+ Yul X) = 20°77 | Pexp(—3 Sve | 


=I 


= (ony a1-*7e0| 5 y (y, — IIx,)'27"(y, — m,)| (11-21) 


Alternatively one might set up the likelihood in terms of the structural equations 
(11-17). This gives 


P(y|x,) = p(u,) 


eae 
dy, 


= p(u,) - |IBIl 
where ||B|| denotes the absolute value of the determinant of B. The likelihood of 
the sample y’s conditions on the x’s is then 


L= (20) "°° |B" "Peso -$ Luz ', 
tl 


= (2m)-"°7 By" Z|-"Zexp| — 5 (By, + Px,)E>"(By, + rx)| 
t=1 


(11-22) 

Comparing Eqs. (11-21) and (11-22) it is easily seen, using Eq. (11-20), that 
(y, — Tx,)'Q"'(y, — Tx,) = (By, + Px,)'2>'(By, + Px,) 

and Q)-"7? = Bn|Z|-"2 
so that Eqs. (11-21) and (11-22) are equivalent. Leaving aside the variance 
matrices = and @, each of which contains G(G + 1)/2 parameters, there are 
G? + GK parameters in Eq. (11-22) and just GK in Eq. (11-21). The likelihood 
function is thus completely specified by the GK parameters in IT. Identification of 
structural parameters in B and T thus depends on the addition of further 
information to the model specified in Eq. (11-16). Such information usually takes 


the form of restrictions on various elements of B and I and, less frequently, on 
the elements of 3. 


Restrictions on the Structural Coefficients 


We will consider the identification of the first equation in the system. The 
methods derived can then be applied to any structural equation. Let us rewrite the 
structural form of the model (11-17) as 


Az, = [B rl] =u (11-23) 


where A =[B_ I] is the G X (G + K) matrix of all structural coefficients and z, 
is a (G+ K) X11 vector of observations on all variables at time t. The first 
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structural equation may then be written as 
OZ, = Uy, 


where a, denotes the first row of A. 

Economic theory typically places restrictions on the elements of a,. The most 
common restrictions are exclusion restrictions, which specify that certain variables 
do not appear in certain equations. Suppose, for example, that y, does not appear 
in the first equation. The appropriate restriction is then 


Bi; =9 
which may be expressed as a linear restriction on the elements of a,, namely, 
0 
0 
1 
[Bu Bi Born Ma nxl}o}| =9 
0 


There may also be linear homogeneous restrictions involving two or more 
elements of a,. The specification that, say, the coefficients of y, and y, are equal 
would be expressed as 


i 
[BeBe ee ee vx) =O 
0 
If these were the only a priori restrictions on 4, they may be expressed in the 
form 


ao =0 (11-24) 
where 
0 1 
lentil 
@=|1 0 
Oe ee 
0 0 


The ® matrix has G + K rows and a column for each a priori restriction on the 


first equation. ; y 
In addition to the restrictions embodied in Eq. (11-24) there will also be 


restrictions on a, arising from the relations between structural and reduced-form 
coefficients. From Eqs. (11-19) we may write 

BIl+T=0 
or AW =0 
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where W= [r| 
The restrictions on the coefficients of the first structural equation are thus 
aW=0 (11-25) 
Combining Eqs. (11-24) and (11-25) gives 
a«[W ]=0 (11-26) 


There are G + K unknowns in a. The matrix [W 9] is of order (G+ K) X(K 
+ R), where R is the number of columns in ®. On the assumption that IT is 
known all the elements in [W ©] are known. Thus Eq. (11-26) constitutes a set 
of K +R equations in G + K unknowns, Identification of the first equation 
requires that the rank of [W ©] be G + K — 1, for then all solutions to Eq. 
(11-26) would lie on a single ray through the origin. This suffices to determine the 
coefficients of the first equation uniquely, for in specifying the general model in 
Eq. (11-17) a B or y coefficient was attached to each variable in every equation. 
Normalizing the first equation by setting one coefficient at unity (say, B,, = !) 
will now give a single point on the solution ray, and this determines a, uniquely. 


p[W ®]=G+K-1 (11-27) 


is clearly a necessary and sufficient condition for the identifiability of the first 
equation. The condition for the identification of the ith structural equation is 


e[W &]=G+K-1 


where ®, is the matrix embodying the a priori restrictions on the /th equation. The 
basic difficulty with the rank condition, as stated in Eq. (11-27), is that it is not a 
convenient one to apply since it requires the construction of the II matrix, which 
is complicated even in small models. We will give below an equivalent condition 
in terms of structural parameters which is easier to apply. However, condition 
(11-27) does yield necessary conditions for identification which are very simple to 
apply. Since[W ®] has K + R columns, a necessary condition for Eq. (11-27) to 
hold is that 


K+R>G+K-1 
or R>G-1 (11-28) 
that is, 


The number of a priori restrictions should not be less than the number of 
equations in the model less 1. 


When the restrictions are solely exclusion restrictions, the necessary condition is 
restated as: 


The number of variables excluded from the equation must be at least as great as 
the number of equations in the model less 1. 
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Finally, an alternative form of this last condition may be derived by letting 

g = number of current endogenous variables included in equation 

k = number of predetermined variables included in equation 
Then 

R=(G-g)+(K~-k) 
and the necessary condition becomes 
(G-—g)+(K-k)=G-1 

or K-k2>g-1 


that is, 


The number of predetermined variables excluded from the equation must be at 
least as great as the number of endogenous variables included less |. 


The necessary condition is referred to as the order condition for identifiabil- 
ity. In large models this is often the only condition that can be applied since 
application of the rank condition becomes difficult, if not impossible. 

The rank condition (11-27) may be restated as} 


p[W ®]=G+K-1 if and only if p(A®)=G-—1 (11-29) 


Note carefully that [W J is a matrix consisting of the two indicated sub- 
matrices, while A® is the product of two matrices. The second form of this 
condition only involves the structural coefficients and thus affords an easier 
application. When the restrictions are all exclusion restrictions, the first row of 
A® is a zero vector and the remaining G — | rows consist of the coefficients in the 
other structural equations of the variables which do not appear in the first 
equation. 

If equality holds in Eq. (11-28), that is, R = G — 1, so that the number of 
restrictions on the first equation is just equal to the number of structural 
equations less 1, the matrix A® is then of order G X (G — 1). However, the first 
row of this matrix is zero by virtue of a = 0. This leaves a square matrix of 
order G — 1 which, apart from some freakish conjunction of coefficients, will be 
nonsingular. The first equation is then said to be exactly identified or just 
identified. Suppose instead that R > G ~ 1. Then A® has G or more columns. 
There are now more restrictions than strictly required for identification, and in 
general there will be more than one square submatrix of order G — | to satisfy 
the rank condition, The equation is then said to be overidentified. 

A direct proof of the rank condition in terms of the A® matrix may be 
obtained from an alternative approach to the identification problem. We saw in 
one of the examples how taking linear combinations of the equations in a given 


+See F. M. Fisher, The Identification Problem in Econometrics, McGraw-Hill, New York, 1966, 
Chap. 2; or for a shorter proof, R. W. Farebrother, “A Short Proof of the Basic Lemma of the Linear 
Identification Problem,” International Economic Review, vol. 12, 1971, pp. 515-516. 
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structure could yield a new structure which satisfied the same a priori constraints 
as the original structure and had identical reduced-form coefficients. Let 

A=[B T] 
denote an original set of structural coefficients (that is, with specific numerical 
values), and let FA denote a new structure obtained from A by premultiplication 
with an arbitrary G X G nonsingular transformation matrix F. The new structure 
is said to be admissible, or equivalently F is said to be an admissible transforma- 
tion matrix, if FA satisfies all a priori restrictions on A.} Identifiability of the first 
equation then requires that the first equation of every admissible structure be 
some scalar multiple of the true first equation. The first row of A may be 
expressed as 

a, =e,A 

where e, is a 1 X G row vector with unity in the first position and zero elsewhere. 
Thus the a priori restrictions on the first equation may be written 

e(A®) =0 
The first row of coefficients in the transformed structure may be written as f,A, 
where f, denotes the first row of F, For an admissible structure this must obey the 
same restrictions as a,, and so we must have 


{,(A®) =0 
Identifiability requires that f,A be a scalar multiple of e,A, that is, that f; be a 
scalar multiple of e,, which gives the condition that p(A®) = G — 1. If all the 


equations of a model are identified, the only admissible transformation matrices 
are diagonal matrices. 


Examples. To illustrate the application of the conditions for identifiability we 
shall work with the two-equation system 


Budi + Bidar + YrXte + Ni2X20 = Me 
Boy Vie + Bar Yar + YarXue + Ya2¥20 = 2x 


As it stands, both equations are unidentifiable since no a priori restrictions have 
yet been imposed. Each example will postulate a different set of restrictions. 


Example 11-1 Suppose the a priori restrictions are 
Y2=0 Yn =0 
For the first equation ® is then a four-element column vector 


+ The general definition of admissibility also requires that the variance matrix of the transformed 
disturbances satisfy all the a priori restrictions on the original variance matrix, but we are restricting 
consideration here to the structural coefficients. 
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dia pee |) 0 
and A® = a = Ba 


Thus p(A®) = 1 = G — 1, and the first equation is identified, provided, of 
course, that y,, * 0. If y) were zero, the variable x, would not appear in 
either equation, and so the fact that it was absent from the first would be of 
no help in identifying that equation. In a similar fashion, the restriction on 
the second equation gives 


0 
pe ik 
p Tas 
0 
Yun 
ao =|] 
and p(A®)=1=G-1 
Alternatively the equations 
a[W o]=0 
in the parameters of the first equation give 
m™ TM O 
™ ™ |=[o 0 0] 
v 
[Bu Bi Yu vl 1 en) 
0 baal 
that is, 
Bum + Bit + 1 = 9 
Bum + Birt + %2 = 9 
Yn =0 
If we normalize by setting, say, 8, = 1, these give 
Pra 
By Ty 
2M — M72 
and Y= rome) 


which shows explicitly how the parameters of the first equation may be 
derived uniquely from those of the reduced form. The parameters of the 
second equation may be obtained in a similar fashion. 


Example 11-2 The restrictions are 
W=9 Y= 0 


For the first equation 


ee OCS 
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and A®= [°| 


which has zero rank. Thus the first equation is not identifiable; nor is the 
second, for this is the case we alluded to in Example 11-1, where x, appears 
in neither equation. 


Example 11-3 The restrictions are 
m= Y2 = 0 Yn = 0 


This example might be treated in two ways. In one approach we note that the 
restrictions y,) = 0 = y2) mean that x, does not appear in the model at all. 
Thus the model could be reduced to one with just a single exogenous variable, 
in which case the only restriction is y,, = 0, and that suffices to identify the 
first equation, but leaves the second unidentified. Alternatively, retaining the 
dimensions of the original model, the restrictions on the first equation give 


0 0 
w KOF 0 e Nn Qise 
® rile with ae =[ | 
01 


Thus p(A®) = 1 = G— |, and so the first equation is identified. For the 
second equation 


0 
0 5 _|0 
® 0 with A® [ °| 
1 
so that this equation is not identified. Alternatively, for the second equation 
a,[W o]=0 
gives 
™ ™ 0 
[Bo Bo Yn Ya] ™ Tm 0}. [0 0 0 
1 Yel) Ok G ] 
0 Lio Al 


This appears to give three equations in four unknowns. Setting 8,, = ! would 
then determine the remaining parameters of the second equation, However, 
the restrictions y,. = 0 = y.) imply 7, = 0 = 7. Thus the second and third 
columns in[W J are identical, and so we only have two equations plus a 
normalization rule, which are insufficient to identify the second equation. 


Example 11-4 The restrictions are 


m=0 Y2=0 
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For the first equation 


0 0 
[07-10 
Ms 10 
0 1 
0 0 
nd - 
Ms A® fea a 
so p(A®) = | and the first equation is identified, while the second is not. 
a(W ®]=0 
gives By, + Bim + M1 = 0 
Buta + Birt + M2 = 9 
m=0 
= 0 
which, on setting B,, = 1, gives 
By ese Seed 
Br ™ M2 


This does not imply a contradiction, for both expressions for 8; will yield an 
identical value. The prior specifications and the normalization rule in this 
example give the model 


Vir + Bidar = Mae 
Bayar + Yar + Yar% ie + Yo2%ar = M20 
The matrix of reduced-form coefficients is 


= bag ao _ 1) Bia Bi2¥22 
MM.) “| Yai Y22 


where A = 1 — B76). Although IT is a 2 x 2 matrix, its rank is only 1. This 
is an example of overidentification. Only one prior restriction is needed to 
identify the first equation, but we have two. The consequence is a restriction 
on the reduced-form coefficients. Notice also that even in the overidentified 
case p(A®) cannot exceed G — 1. A® has G rows, but the first row is always 
zero for homogeneous restrictions, so p(A®) < G — 1 even in cases of 
overidentification where A® has G or more columns. If II is replaced in an 
actual two-equation problem by II, the matrix of estimated reduced-form 
coefficients, then p(T) will almost certainly be 2 and not 1, so that estimating 
Byy by —%y,/%1 OF bY = thy/th, would yield two different values. ILS is thus 


not a suitable estimation method for overidentified equations, since it fails to 


yield unique estimates. 


Example 11-5 The restrictions are 
m=0 2 = 0 By + Yau = 9 Yn = 0 
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This is Example 11-3 with the additional specification Ba, + Yo) = 9. In 
Example 11-3 the first equation was identifiable and the second not. Leaving 
x, out of the model, we now have for the second equation 


1 

®=|0 

1 
and a - [fy 
0 


so p(A®) = 1 and the second equation is now identified. 


In all the above examples readers should check for themselves that the 
necessary condition (or order condition, as it is often called)’ would correctly 
indicate the presence or absence of identification. This need not always be the 
case. For example, if B,, in Example 11-5 were zero, the rank condition would fail 
even though there is one restriction on the second equation. 


Treatment of Identities 


Identities themselves do not raise any identification problems since in general the 
coefficients are known and indeed are usually unity. The general model 


By, + Ix, =u, 


may, however, be formulated in two alternative fashions. In one version all 

identities appear explicitly in the model. In the alternative version the identities 

may be substituted in other structural equations, thus effectively reducing the size 

of the model. The identification rules may be applied to either version. Solving 

out the identities will not change any conclusions about the identifiability of any 

behavioral or other structural equation whether in its original or revised form. 
As an illustration consider the simple supply-and-demand model 


gq? =ajtapt+u 
g° = By + Bip + Bw + uz 
qe=q5 

where q? = quantity demanded 


q° = quantity supplied 
p = price 
w = an index of weather conditions 


This is a model containing three endogenous variables q”, q°, and p (G = 3) and 
two exogenous variables w and z (a dummy variable) set at unity to take care of 
the intercept term in the first two equations. Rearranging the model in more 


SIMULTANEOUS EQUATION SYSTEMS 461 


suitable form we have 


D 
1 0 -a, 0 =a Be uy 
0 1 =p, —-b, —By a, =| u2 
1 = 0 0 0 w 0 
Zz 
For the first equation 

Ding JO 

Ad=| 1 -f 

=I 0 


and p(A®) = 2 = G — 1 so that the equation is identified. Notice that when we 
have exclusion restrictions, the A® matrix can be written down directly by taking 
the columns of the A matrix which contain zeros in the row corresponding to the 
equation under study. For the second equation : 


which only has rank unity, and so the second equation is not identified. 
If we rewrite the model without the identity, it becomes a two-equation model 

in two endogenous variables q and p, 

qrHTa% ar ap ai uy 

q = By t Bip + Bw + up 
where now G = 2, and the first equation is again just identified because it has one 
restriction on its coefficients while the second equation is not identified because 
there are no restrictions on its coefficients. 


Inhomogeneous Linear Restrictions 

The linear restrictions embodied in Eq. (11-24) are all homogeneous, that is, 
specified coefficients or linear combinations of coefficients are set equal to zero. 
Many restrictions indicated by economic theory occur naturally in a nonhomoge- 
neous form, an illustration being the specification, say, that the elasticities in a 
production function sum to unity. Such restrictions, however, have no meaning 
until a normalization rule has been imposed. Thus if we have the restriction 


By tm=! 
it can be written as 
By tm- Bu =0 


plus the normalization rule £,, = 1. Thus inhomogeneous restrictions can be 
Tecast in homogeneous form before normalization and the previous procedures 


still apply. 
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Restrictions across Equations 


So far we have only considered linear restrictions within a structural equation. 
There are cases, however, where theory suggests restrictions across equations, 
some examples of which have already been encountered in Sec. 8-6. These can 
also serve to ensure identifiability, as is shown in the following simplified 
examples. 


Example 11-6 Consider the model 

Vy t+ Bry t+ WX = 4 

By Y + V2 + Yoi%1 = M2 
Without further restrictions neither equation is identified. The imposition of 
cross-equation restrictions requires that each equation be normalized, other- 


wise the restriction is ambiguous. Suppose there is a theoretical basis for 
postulating 


Yu + Y= 9 


Identifiability in the presence of the restriction may be examined either by 
looking at the relationship between structural and reduced-form parameters 
or by investigating the set of admissible transformed structures that satisfy 
the restriction. The reduced-form equations are 


eas 
yy = anit + By2)x, + 


Re 
yz zl + By,)x, + v2 


where 


A=1- BB 


The reduced form yields only two parameters and, even with the restriction, 
there are still three structural parameters. It is clear that neither equation is 
identified.+ 


Example 11-7 Consider 
Yt Wx = 
Boy Vy + Yo + Yai% = 2 


+ The argument to the contrary in G. S. Maddala, Econometrics. McGraw-Hill, New York. 1977. p- 
230, is incorrect. Maddala investigates identifiability via transformation matrices. However. he 
essentially postulates a transformation matrix 


F=[3 1] 


and then finds that the restriction implies = 0, which leads him to conclude that both equations are 
identified. But F has already assumed that the second equation is identified. which is an invalid 
assumption. The identifiability of both equations has to be considered jointly. 
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Postulating the transformation matrix 


ite fu ‘a 


ro 
fa ha 


the transformed structure is 
(fir + firBai) 1 + fi2d2 + firm + fr2¥a) 1 


(fay + farBoi) 1 + far ¥2 + (fartnn + fratni) 1 = UF 


The requirement that the transformed structure satisfies the same a priori 
constraints as the original structure, namely, that y, does not appear in the 
first equation, gives 


{I 


uy 


f= 0 
The normalized transformed structure is then 
Vy t+ YX = ul 
+ 
“fabs Jn +y,+ eal fon jx = us* 
22 22 


If we now impose the cross-equation constraint y,; + Y2) = 0 on the original 
structure, the same condition on the transformed structure gives 
fart + fan 
Yn + SS aie =0 
2 


OF far = 9 
giving 
fy =0 
so that all admissible transformation matrices are diagonal and both equa- 


tions are identified. ; 
Alternatively the reduced form of the model is 


Ji = Yee 
Y= (Barun — Yn), + 2 = yi(Ba + 1), + 2 


can be obtained from the first reduced-form coefficient and 


The parameter 
is ed-form coefficient, so that both 


B,, can be derived from the second reduc 
equations are identified. 2 


Restrictions on the Variance Matrix 


So far the only explicit assumption about the disturbances has been that of serial 
independence, but we have made no explicit assumptions about contemporaneous 
correlations between disturbances in different structural equations. Let 


== E(u) 


> is then a G X G matrix, the terms on the principal diagonal indicating the 
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variances (assumed constant) of the disturbances in the G structural equations 
and the off-diagonal terms indicating the covariances between pairs of dis- 
turbances. If specific restrictions can be placed on some of these elements, they 
constitute an additional source of identifying power. 

Let us examine first of all restrictions on covariances. Consider the model 


Vit M1 = 4 

Bay yy + Yo + Yairi = 42 
As is easily seen, the first equation of this model is identifiable and the second is 
not. We shall, however, examine the identifiability of the model again by 
considering admissible transformation matrices, as this approach facilitates the 

study of restrictions on variances and covariances. 
Using 

F= fu ff "| 


hi fe 


the transformed first equation becomes 


(fu + firBar) 1 + fiad2 + (fit + Saver) 80 = fits + Sir! 


If the coefficients of the transformed equation are to obey the same restrictions as 
those of the original equation, we must have 


fir t+ fi2b = 1 
fnr=90 


giving f,, = 1 and f,, = 0. The only restriction on the second equation is the 
normalization condition, which is held in abeyance. Thus admissible transforma- 
tion matrices are given by 


Belge 


showing that the first equation is identified and the second not. 
Suppose we can now postulate 


0 0 
z = i 
0 oy 
The vector of disturbances in the transformed structure is Fu,, and so the 
variance-covariance matrix for the disturbances of the transformed structure is 
¥ = E(Fuy,F’) 
= FIF’ 
This must obey the restriction that the covariance between the two transformed 
disturbances is zero, that is, 


f,3f, = 0 


: val lies 
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that is, 
fio, = 0 
which gives 
fy =0 


The value of f, is then settled by the normalization condition that the coefficient 
of y, in the second equation must be unity. The coefficients of the transformed 


structure are given by 

yet Me 0: yi 

tr fo le, 1 “A 

giving the coefficient of y, in the second equation as f,). Thus f,, = 1, and the 
only admissible transformation matrix is 


F-[ 


so that both equations are identified.} 
As a further illustration consider the model 
V+ Mi = 
By, ¥, + Yo + Yai%1 = Ho 
Bs, ¥1 + Baa V2 + Ys + Yai%1 = M3 
Without further restrictions only the first equation is identifiable. If, however, we 
assume 


ra-| 


oy B® 
Z=|0 oo 0 
0 0 oy 


the second and third equations become identifiable. Consider 
1 Oy Onl | gl Oh Os ri 
FA=|f fa fa|{ Br 1 9 Ya 
fa fra Soa} Ba By 1 Ym 
The normalization condition on y, in the second equation and on y; in the third 
give 
fa t+ fsBa = 1 
fg = 1 


but not necessary, to impose the normalization condition on the 
uations, respectively, of the transformed structure. 
= 0. The zero covariance term then gives f,, = 0 


¥ It is convenient algebraically, 
Coefficients of y, and y, in the first and second eq 
The absence of }, from the first equation gives /\2 
and so the class of admissible transformation matrices is 


_|fu 9 
haben a 


which secures the identification of both equations. 
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and the exclusion of y, from the second equation gives 


fy =0 
which also implies 
fn=1 
Thus F is now 
toe Oly 0 
F=|fr ! 0 
fy fo 1 


We have not yet considered the effect of the zero covariance restrictions 6), = 9); 
= 0; = 0. These must be satisfied by the transformed structure. Hence 


f,2f, = 0 
f,2f, = 0 
f,3f, = 0 


The first of these gives 
a, 0 0} fa 
[1 0 oj) 0 o, 0 1) = fo, = 9 
OP 0"™ 653110: 
so that 
fy = 0 


and in a similar fashion the second and third conditions gives f,, = 0 and /;, = 0. 
Thus the only admissible transformation matrix is 


1.0 0 
0 1 0 
OC 1 


F= 


and all three equations are identified. 

The above model has two special features, namely, a triangular B matrix and 
a diagonal & matrix. The presence of these two features defines a recursive system. 
All the equations of the recursive system are identified and, as we shall see below, 
simple estimation procedures are available for this model. 

Zero covariances can aid identification and not necessarily just in recursive 
systems. For example, in 


N+ Bry, = uy 

Bay + Yn + YorX1 = U2 
the first equation is identified and the second is not. However, the additional 
specification o,, = 0 would serve to identify the second equation as readers can 
easily prove for themselves. There is no simple necessary and sufficient condition 


for the zero covariance case as there was for restrictions on the B and 
parameters, so each case must be examined from first principles. 
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The discussion has dealt only with models which are linear in variables and 
parameters. Many realistic models, however, may be nonlinear in variables 
and/or a priori restrictions. Identification theory for such models is difficult and 
has only been partially developed. Owing to the unsatisfactory state of the theory 
it will not be summarized here. Interested readers should consult Fisher.} 


11-3 ESTIMATION OF SIMULTANEOUS EQUATION MODELS 


Whether we wish to estimate an equation which is one of a set of equations 
constituting a complete model or whether we wish to estimate all the equations of 
a model we are in a situation where OLS and the variants of OLS that we have 
considered so far in the context of a single-equation model are, in general, 
unsatisfactory estimating techniques. If OLS is applied to an equation in a model, 
there will usually be more than one current endogenous variable in the relation, 
and whichever variable one selects as the “dependent” variable, the remaining 
endogenous variable(s) will generally be correlated with the disturbance term in 
the equation so that OLS estimates will be biased and inconsistent. Only in the 
case of recursive models will OLS be an optimal estimating technique. 

In the more general simultaneous case, where the special assumptions of a 
recursive system are not fulfilled, the main estimating techniques are indirect least 
squares (ILS), two-stage least squares (2SLS), both of which may be interpreted 
as IV estimators, limited-information maximum likelihood (LIML), three-stage 
least squares (3SLS), and full-information maximum likelihood (FIML). ILS, 
2SLS, and LIML are essentially single-equation methods, in which attention is 
focused on one equation at a time without using all the information contained in 
the detailed specification of the rest of the model. 3SLS and FIML are system 
methods, where all the equations of the fully specified structural model are 


estimated simultaneously. 


Recursive Systems 


As we have seen already, the two crucial features of a recursive system are a 
triangular B matrix and a diagonal 2 matrix. As an illustration consider the 


model 
Viet Wii = Me 
Bay ie + Yar + Yar%e = M2. 


with the specification 


o;, »0 
E(u’) = 2 = 0 | 


+F. M. Fisher, The Identification Problem in Econometrics, McGraw-Hill, New York, 1966, 
Chap. 5. 
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To explore the connection between the y’s and the u’s we look at the reduced-form 
equations which are 


Vie = Mie + Me 

Voy = (Baran — Yar) %_ + (lay — Bo\t,) 
The first equation is the same in each case. Since the exogenous variable x is by 
assumption uncorrelated with the u’s, the first equation may be estimated 
consistently by OLS. The second reduced-form equation shows y,, to be a 
function of both u,, and u,. Thus it would be inappropriate to estimate the 
second structural equation by an OLS regression of y, on y, and x. However, y;, is 
uncorrelated with u,, since it is a function only of u,,, which has zero correlation 
with u5,. Thus an OLS regression of y, on y, and x will yield consistent estimates 
of the second structural equation. 

More generally, the disturbance vector in the reduced form of a model is 


(11-30) 


When B is lower triangular, then so is B~'. Thus Eq. (11-30) gives 
Yu =f(u,) 
Jat =f(u,, uy,) 
Van = L (Mies Cry M30) 
0 
Yor = f(Ujes Uays+-+s Mar) 


The assumption of a diagonal © matrix then ensures that y,, is uncorrelated with 
u,,, that y,, is uncorrelated with u,,, and so forth. Thus the second structural 
equation may be estimated consistently by an OLS regression with y, as the 
dependent variable, the third with y, as the dependent variable, and so on. 

It is also easy to show that if the u’s are normally distributed, OLS yields ML 
estimates. As was shown in the previous section, the likelihood of the sample y’s, 
conditional on the x’s, for the model 


By, + Ix, =u, 
is given by 


L = (29) -"°/? |B" - [BI *7exp(- $ y ws- ', 


=1 
For recursive systems |B| is unity and 2 and =>" are both diagonal. Thus finding 
the B and f to minimize L is equivalent to finding the B and f to minimize 


n 
-1 
Lw2"'y, 
t=1 
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For a three-equation system this sum of squares is 


ae 0 0 uy, 
= 1 
S= Liu, ux, us]} 0 rare 0 U2, 
t=1 22 
1 
0 Oe Ht: 
033 3 


n 2 2 2 

u u u 
yyy beh ers 
MD wae LED 


t=1 
Thus the partial derivatives of In L with respect to the coefficients of the ith 
structural equation are simply the partial derivatives of 

2. 

y Me 
t=! 11 
Setting these partial derivatives to zero gives the OLS equations for the ith 
structural equation. Thus under the special assumptions of the recursive model 
the OLS estimators of the structural equations will have the desirable properties 
of consistency, asymptotic normality, and efficiency. They will also have the usual 
small sample properties.¢ 


Indirect Least Squares 


As indicated in Sec. 11-1, ILS is a feasible estimation technique for an equation 

which is just identified. The first step consists of estimating the matrix of 

reduced-form coefficients by the application of OLS to each of the reduced-form 

equations. The estimates of the structural coefficients are then obtained from the 

algebraic relations existing between structural and reduced-form coefficients. 
The structural model at time period ¢ has been written as 


By, + Ix, =u, (11-31) 
where 
Vu ir 
Vat Xa 
Y= |ite and x,= 
ie Ke 


are, respectively, the GX 1 vector of observations on the jointly dependent 
endogenous variables at time ¢ and the K X 1 vector of observations on the 


+ For a proof that the usual small sample inference procedures apply see E. Malinvaud, Statistical 
Methods of Econometrics, 2nd edition, North-Holland, Amsterdam, 1970, pp. 679-681. 
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predetermined variables at time ¢. Let us define Y and X as 


, 
eye aj als See 
Stine = eg 

Y= x= 

, 
-— x - — Sx 


so that Y is the n X G matrix of the sample observations on the endogenous 
variables and X is the n X K matrix of sample observations on the predetermined 
variables. From Eq. (11-31) we then have 

YB’ + XI’=U (11-32) 


where U is the n X G matrix of all the sample disturbances. The reduced form 
may then be written 


Y=xIl'+v (11-33) 
where 

I’ = -1(B’)' (11-34) 
and Vv=U(B) ! 


The matrix of reduced-form coefficients defined in Eq. (11-34) is simply the 
transpose of the matrix previously defined in Eqs. (11-19). The estimation of IT’ is 
accomplished by applying OLS to Eq. (11-33) giving 


P’ = (X’X) ‘XY (11-35) 


This yields the set of estimated reduced-form coefficients for the first stage of ILS. 
Let us denote the equation we are interested in estimating by 


y=YB+X,y+u (11-36) 
where y=n X 1 vector of observations on the dependent (endogenous) variable in 
the equation 
Y,=n X(g—1) matrix of observations on the other g—1 current endoge- 
nous variables in the equation 


X,=n Xk matrix of observations on the k predetermined variables in the 
equation 
u=n X 1 vector of disturbances in the equation. 


“les 


Rewriting Eq. (11-36) gives 


fy Y¥, X,) 


or, more fully, 
1 


ly ¥ ¥% X, X]} 9 }=u 
only, 
0 
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where Y, and X, are matrices of observations on G — g endogenous and K — k 
predetermined variables which are excluded from the equation. 
The relations between structural and reduced-form equations are given in Eq. 
(11-34), which may be rewritten as 
Il'B’ = -T’ 
The relations holding for the coefficients of the structural equation (11-36) are 
then 


ae en oy : 
II 4] [7 (11-37) 
died J 


KxG Gx Kxi 


Substituting in this from Eq. (11-35) gives the ILS coefficients as the vectors b and 
¢ obtained by solving 


(XX); ey 


1 _fe ’ 
re = [5] (11-38) 


The crucial question is whether there are unique solution vectors b and c. 
Rewriting Eq. (11-38) as 


1 
(xx) 'X'ly Y, x 5 = (6 
0 
gives 
-ly,, “x)-'yv p = | i 
(XX) 'X’y — (X’X) xyb = [6| (11-39) 
Premultiplying by (X’X), partitioning X as [X, X,], and rearranging gives the 
pair of equations 
(X{¥, b + (XX, )e = Xiy » (11-40) 
(X5Y, )b + (X5X;)e = Xoy (11-41) 
Together these constitute K equations in (g—1)+k unknowns. Since the 
necessary condition for exact identification is 
K-k=g-1 
we have the same number of equations as unknowns so that, in general, Eqs. 


(11-40) and (11-41) solve uniquely for the ILS estimates b and ¢. 
These equations also indicate how the ILS estimates may be interpreted as IV 


estimates. Returning to the structural equation 

y=Y,B+Xyt+u 
the inconsistency of OLS arises from the correlations between y, and u. The x 
variables, however, are uncorrelated with u, and in the exactly identified case X, 
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will have the same number of columns as Y,. This suggests using 

x, X] 
as the set of instruments for [Y, X,]. The resultant IV estimates are given by 

bee XX, Ie| ws Pe 
XY, = X1X |] ery xy 

which are identical with Eqs. (11-40) and (11-41), Notice that the ordering of the 
instrumental variables is unimportant. We can just as well take 

=([X, X,] 
as the matrix of instrumental variables. Rewriting Eq. (11-36) as 

y=Zb+u 


where 


Z,=[Y, X,]. and 8= (5 


the IV estimator of 8 is 


b 
diy = [Re 


= (X’Z,)'X’y (11-42) 


which is easily seen to be identical to the ILS estimator defined in Eqs. (11-40) 
and (11-41). 


Two-Stage Least Squares 


In practice ILS is not a widely used technique since it is rare for an equation to be 

exactly identified. 2SLS is perhaps the most important and widely used proce- 

dure. It is applicable to equations which are overidentified or exactly identified. 

Moreover, it turns out that in the case of an exactly identified equation the 2SLS 

estimates are identical with the ILS estimates given by Eqs. (11-40) and (11-41). 
Consider again the estimation of the equation 


y=YB+Xy+u 
where the necessary condition for identification requires that 
K-k2>g-1 


As we have seen, the trouble about applying OLS directly to this equation is that 
the embedding of the equation in a simultaneous equation model makes the 
variables in Y, correlated with u. The 2SLS technique consists of replacing Y, by a 
computed matrix ¥,, which hopefully is purged of the stochastic element, and 
then performing an OLS regression of y on x and X,. 

The matrix ¥, is computed in the first stage by eieains each variable in Y, 
on all the predeteemines variables in the complete model and replacing the actual 


SIMULTANEOUS EQUATION SYSTEMS 473 


observations on the y variables by the corresponding regression values. Thus 
¥, = X(X’X) 'X’y, (11-43) 


In the second stage the regression of y on Y, and X, yields the estimating 
equations 

b Viy 
xy 


wy, YX, 
xm, Xi 


(11-44) 
c 


where [>] now denotes the 2SLS estimator of [*]. For the actual estimation 


there is no need to compute the regression values in Y, explicitly. An alternative 
form of Eq. (11-44) can be derived which involves only the matrices of actual 
observations. The matrix Y, can be written as 

Yyj=¥, +; 


where Y, is given by Eq. (11-43) and V, is the n x (g— 1) matrix of OLS 
residuals. The usual properties of OLS residuals give 


Viv, =0 
and xv, =0 
Thus vy, i YY, =) 
= WY, 
= Y/X(X’K) 'X’Y, 
and YX, = (¥, — V)™% 
= ¥iX, 


Thus the equations for the 2SLS estimator can now be written 

¥iX(X’X) 'X’y 
AY 

which is useful for further theoretical 


b 
c 


yix(X’X) "XY, YX, 
wy; XX 


(11-45) 


Yet another form of the 2SLS equations, 
developments, is 
yy, - vv, YX, ][b] _ [4 — ae 11-46) 
xi, xX, |[°¢ Xiy 


The equivalence between Eqs. (11-45) and (1 1-46) may be proved by the reader as 


an exercise. 


Example 11-8 The first structural equation in a three-equation model is 


Vie = Bidar t+ Me + Y2%2, + 4 


There are four predetermined variables in the complete model, and the XX 
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matrix is 


couoc 
onoo 
noce 


In addition we are given 


, 


lel 
2 4 1 
y yl X= (s Sag 
lloa 
The necessary condition for identification is satisfied since K — k = 2 and 
g§ — 1 = 1 so that the equation is overidentified. To estimate the parameters 
by 2SLS we need to establish a correspondence between the data in this 


problem and the vectors and matrices in Eq. (11-44). Thus 
| | I dl i 
y=ry Y, =| %2 X, =] Xr X2 X, =] X30 X4 
| | aca of 
YX=[1..0 2 1] . ¥{X, =Gl~.0) 


ow 


2 
; 10 0 peels eB 
xixi= [1s] X92 |) xiv = [3] 
1 
and so 
Ono mOn eo O71 
ViX(K'X)” Ixy, = [1 002 V0 o? ne : = 1.6 
0 0 0 osdli 


The 2SLS equations are then 


16 1 olf] [2.7 
110), Vole, |= 12 
0 0 8S} Ies 3 


bys 1.6667 
cy | = | 0.0333 


¢p| | 0.6000 


2 
Y/X(X’X) 'X’'y=[0.1 0 0.5 vl ara, 
1 


with solution 


Example 11-9 For a model 
Vir = Bir Vae + YXue + uy, 
Yar = Bair + YarXae + Ya3Xa, + Uy; 
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The sample matrices are} 


lye 0 S10 
XX=|0 20 0 XY=]40 20 
0 oO 10 20 30 


We will illustrate the application of 2SLS and ILS, as appropriate, to this 
model and also look at the estimation of the reduced-form coefficients. 

The first equation is overidentified and is estimated by 2SLS. The 
correspondence between the variables in the equation and the matrix expres- 
sions in Eq. (11-44) is given by 


yy Y, =| %2 X= | X,=]%2 %3 
| | | i 
Thus 
10 5 
x’Y, = | 20 XiY,=10 Xy=]40 xy = X{X, =1 
30 20 


ee an: 0a) (p20) 
Y/X(X’X)'X’Y, = [10 20 30]}0 0.05 0 |} 20 
Gh 0) ) 08! 


10 
=[10 1 3]| 20] =210 
30 


5 
y{x(X’X)'xX’y = [10 1 aja] 
20 


The 2SLS equations are thus 


Pe sl 


The second equation is just identified and thus may be estimated by 
2SLS or ILS. For the 2SLS approach 
| | Jestal | 
y= |% y,=|% X,=|%2 %3| X= xX 
| | ae | | 


with solution 


+ In this and the previous example the XX matrices are assumed to be diagonal to keep the 
arithmetic simple. In realistic situations orthogonal variables are, of course, very rare. 
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Thus 
5 10 
xy, |. xy, - [| X'y = | 
20 30 
: : 2 0 
xiy = [29] xix, =[ 0 10 
1p, Ovi supO 5 
y;x(x’x) 'xy,=[5 40 20]]0 0.05 0 ||40 
‘Lo o © 0.1 J20 
5 
=[5 2 2]} 40] = 145 
20 
10 
Y¥;X(X"x) 'x'y=[5 2 2]] 20] = 150 
30 
The 2SLS equations are thus 
145 40 20][b,, 150 
40 20 O||c,]=| 20 
20 0 10}/c, 30 
with solution 


To obtain the ILS estimates of the second equation we need to specify 
the additional matrices appearing in Eq. (11-42). These are 
XXY,=5 X,X,=[0 0] Xiy=10 
The ILS equations are then 
10 
=] 20 
30 


5 0 olfa, 
40 20 Oflc,, 
2 0 10]}c, 
with the same solution vector as 2SLS. This is an illustration of a general 


result that 2SLS and ILS estimates, where the latter exist, are identical. The 


general result will be proved below, but in the meantime we continue with the 
numerical example. 


The reduced-form coefficients, estimated by OLS, are 


P’ = (x’x) 'xy 
Ir 0° Fegugihts® 10 
=10 005 0 |/40 20 
0 0 ~~ o.1sl20 30 


i. 
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giving 
Vip = 5Xy, + 2X2, + 2x5, + 01, 
and Yay = 1OX, + Xp, + 3X3, + V2, 


The reduced-form matrix may also be estimated by substituting the estimated 
structural coefficients B and f in Eq. (11-19), 
f= -8'f 

However, care must be taken in making this substitution since Eq. (11-19) 
was derived from the structural equations specified as By, + 'x, = u,, whereas 
the equations of this model have been specified with just a single endogenous 
variable on the left-hand side of each equation. The 2SLS estimates of the 
structure are 

Yue = Ware - 4x + Mie 

Yar = 2p — 3X24 — X35, + Uy 
Rearranging with all variables on the left-hand side gives 


(2 Mbt sole) -te 


Thus 
1 -®]"'l4; 0 0 
a--[ 1 hs 3 | 
[5 34 8 
10 33 3 


These are somewhat different than the OLS estimates. The reason is that the 
OLS estimates are unrestricted and thus fail to satisfy the restrictions placed 
on the reduced-form parameters by the overidentification in the system. With 
two endogenous and three predetermined variables there are six reduced-form 
coefficients, which are functions of just five structural coefficients. The true 
reduced-form matrix is 
1 mn Buin Pat 

0 =——— 

(fs yy) Buy Yn Yo3 


in which the second and third columns are linearly dependent. 


Interpretation of Two-Stage Least Squares as an Instrumental Variable 

Estimator 

The structural equation to be estimated may be written as 
y=Y¥B+Xytu-Zotu (11-47) 


where 


z,=[%, Xi) and s-[5| 
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Let us recapitulate the discussion of IV estimates in Chap. 9, with the vector of 
unknown parameters now indicated by 6 rather than B and with the matrix of 
explanatory variables simply indicated by Z. The equation to be estimated is 
y = Z8 + u, the problem being that 


plim| +2) +0 
n 
which is the difficulty with Z, in Eq. (11-47). Provided a matrix W can be found 
such that 
1, plim 1ww) = Zw a finite symmetric positive definite matrix 
2. plim 7wz| =%,. a finite nonsingular matrix 
3, plim{ --W'u) =0 
n 
the IV estimator 
dyy = (WZ) ‘Wy (11-48) 
will be consistent and will have an asymptotic variance matrix estimated by 
asy var(dyy) = s2(W’Z) '(W’W)(Z'W) | (11-49) 


aaa (Ce ZA, )'(y — Za) 
n 


where 


In the present case let us set 


Z=Z,=[Y, X] 


and We =A X4 | 
so that ¥; is the set of instruments for Y,. The IV estimator defined in Eq. (11-48) 
is then 
YY YX b, v4 
mame ad Pd le (11-50) 
Y, XX] Cry Xiy 


but we have already seen that YY, = WY, and ¥;X, = Y/X,. Thus Eqs, (11-50) 
and (11-44) are identical, so that 2SLS is in fact an IV estimator with y, as the 
instruments for Y,. 


The consistency of the 2SLS (IV) estimator requires the three conditions on 
W, stated above, to be fulfilled. We will assume that 


tae 1 
lim{ WW) eeeplita aowv4 ) 
p 7 ani plim| sah Z 


SIMULTANEOUS EQUATION SYSTEMS 479 
are both finite. The third condition is 
tim( 2a) 
P a 


1 =0 
plim = Xiu) 


any eile No 
plim( -W'a) = 


Insofar as X, contains exogenous variables, whether current or lagged, these are, 
by assumption, uncorrelated in the limit with the equation disturbance. The same 
result will also hold for any lagged endogenous variables in X, provided the 
disturbance term is serially uncorrelated. The remaining term is 


plim( + fu) = plim( ¥iX(XX)- 'x'u] 
is tim( “ (x) : tim( Lx)» - pli (5x } 
Peal pun Pine 
=0 
since the first two terms are finite and the last is the zero vector. 
It was also shown in Sec. 9-2 that the IV estimators are asymptotically 
normally distributed with an asymptotic variance matrix estimated by Eq, (11-49). 
Substituting for W and Z and using the fact that vy, = Vy, and YX, = ¥/X, 


gives 
b) per WX zt 
asy var =s A 
c xy, XX) 
-1 “1 
XX! r "x 
any. Y{X(X’X) X’Y, Y;X, (11-51) 
xy, XX 
where 
aoe (y — ¥b— X,c)'(y — Yb - X,c) (11-52) 
n 


which is a consistent estimator of 92. Some authors prefer to use the number of 


degrees of freedomn — g —k + 1 as the divisor in s? rather than n. This is also a 
consistent estimator of o,. The ISLS estimators are thus consistent and asymptot- 
ically normally distributed with estimated variance matrix given in Eq. (11-51). 
A problem sometimes arises in the application of 2SLS to equations in 
medium-size or large-size econometric models. The difficulty is that the number of 
predetermined variables in such a model may become large in relation to the 
number of observation points. Suppose, to consider a special case, that the 
number of predetermined variables becomes as great as the number of observa- 
tions, K =n. The X matrix is then square and, in the absence of any exact linear 


+ The detailed conditions for this to be true are set out in H. Theil, Principles of Econometrics, 


Wiley, New York, 1971, pp: 484-488. 
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relations between the predetermined variables, nonsingular. Formula (11-43) thus 
reduces to 


¥, = X(X'X) ‘xy, 
= Xx"'(x’)'X’y, 
= y, 


and 2SLS is equivalent to OLS. The 2SLS estimates would, of course, no longer 
be consistent, since the matrix of instrumental variables is now W = [Y, X,] and 


plim( * Yu} *0 ~— sothat plim( Wa) +0 


as was required for consistency. 

When K > a, the XX matrix is of order K X K and of rank n. Thus it is 
singular, and the inverse (X’X)~' does not exist. This has often led to the 
conclusion that the 2SLS estimator will not exist, since Eg. (11-45), for example, 
involves (X’X)~'. Fisher and Wadycki have pointed out that this is not necessarily 
the case.j They argue that the ¥, matrix will be unique in spite of the multiplicity 
of solutions for the reduced-form coefficients. Consider, for instance, the first 
variable in Y, and denote the n X 1 vector of observations on that variable by y). 
Letting p denote the K xX 1 vector of OLS reduced-form coefficients for that 
variable, the usual formula gives 


(X’X)p = X’y, (11-53) 


Since XX is of order K X K with rank n < K, Eg. (11-53) has an infinity of 
solutions. Letting p, and p, be any two solution vectors, we have 


(X’X)p, = X’y, 
(X’X)p, = X’y, 
Thus (X’X)(p, — p,) = 0 


Premultiplying by (p, — p,)' gives 
(Pp, — p2)'(XX)(p, — p,) = 0 


Thus X(p, — p) = 0 
so that 9, = Xp, = Xp, 
Moreover Eq. (11-53) may be rewritten as 
X'(Xp — y)=0 
t t 


KXn axt 


Since X’ has rank n (< K), the only solution vector is Xp — y, = 0 so that 


7 W. D. Fisher and W. J. Wadycki, “Estimating a Structural Equation in a Large System,” 
Econometrica, vol. 39, 1971, pp. 461-465. 
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§, = y;, The same result will hold for each variable in Y, so that once again 
Y, = Y,, and 2SLS would be equivalent to OLS. 

Various suggestions have been made for dealing with the problem of an 
excess of predetermined variables. Kloek and Mennes suggested replacing X, in 
the first-stage regressions by a smaller number of principal components.} Let F 


denote the m X / matrix of the / chosen principal components and then define 
Z=[X, F] 


This Z matrix takes the place of the X matrix in Eq. (11-45), and the 2SLS 
estimates based on the principal components approach would then be given by 


Y{Z(Z'Z) | i154) 
uy 


YiZ(Z'Z) ‘ZY, Y{X, || bec 
XY, XX, 

Various problems arise with this approach. The first concerns the number / of 
principal components to be used. Kloek and Mennes state that identification 


requires 


pc 


tee ai) 


but it is difficult to see the reason for this condition since the problem is to find a 
suitable matrix Z for the first-stage regressions in which Y, is replaced by an 
estimated matrix ¥,. It is, of course, true that identification of the structural 
equation requires that the number of columns in X;, namely, K — k, should be at 
least equal to g — 1, but there is no reason to carry this condition over to the 
choice of variables used in computing yi 

A second problem concerns the criterion to be used in selecting principal 
components. One possibility is to choose the components with the greatest 
eigenvalues, that is, the components which account for the greatest variance of the 
variables in X,. Some of these components, however, may be highly correlated 
with variables in X,, thus providing little additional assistance in explaining Y, 
and possibly also causing Z'Z to be nearly singular, so that numerical difficulties 
arise in computing the inverse. Kloek and Mennes have suggested components 


which have the /east correlation with the X, matrix. Both approaches involve 


substantial computation and also imply different sets of principal components for 


different structural equations. a 

The last difficulty is avoided by calculating, once and for all, principal 
components of the complete set of predetermined variables and using a subset of 
these in the first-stage regressions for each structural equation. In a very interest- 
ing study Klein estimated a revised version of the Klein-Goldberger model of the 
U. S. economy by using just the principal components corresponding (1) to 
ight largest eigenvalues of X’X in the first stage of 


the four largest and (2) to the et : : : 
the 2SLS procedure. Comparing the predictions of GNP in the sample period 


“Simultaneous Equation Estimation Based on Principal 


dL. B, M. Mennes, 
{Toles vol. 28, 1960, pp. 45-61. For a review of 


Components of Predetermined Variables,” Econometrica, 
principal components see App. A-10. 
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from these two estimators with OLS and FIML, Klein found 2SLS based on just 
four principal components to give the smallest absolute percentage error followed 
by the other 2SLS estimator, OLS, and FIML in that order.¢ 

An alternative approach based on instrumental variables has been suggested 
by Brundy and Jorgenson to bypass the substantial computation involved in 
calculating the reduced-form coefficients required for ¥,.¢ Let 


E(Y,) = Xl, 


where II, is the K X (g — 1) submatrix of reduced-form coefficients relevant to 
the variables in Y,. The Brundy-Jorgenson suggestion is as follows. 


1. Define a matrix of instrumental variables as 
w, = [xfl, x,] (11-55) 


where If, is any consistent estimator of I, 
2. Then compute the structural coefficient estimator from the IV formula as 


a= [>] = (wiz,)"'wiy (11-56) 
where Z,=[Y, X,] 


The regular 2SLS estimator satisfies these conditions, for fi, = (XX) 'X’Y, 
is a consistent estimator of II, and W, then becomes [¥, X,]. The novelty of the 
Brundy-Jorgenson approach is to avoid computing reduced-form coefficients and 
to derive an appropriate fi, by first obtaining B and fas consistent estimators of 
B and I and then using 


f= -6-'f 

from which the relevant submatrix IT, can be extracted and XI, computed for 
insertion in Eq. (11-55). Thus even if one is interested in just a single structural 
equation, this approach requires the initial computation of consistent estimators 
of all structural coefficients. On the other hand, if one is estimating all the 
equations of a model, the single [1 matrix is used to provide all relevant I, 
submatrices. 

Several suggestions are offered for initial consistent estimation of the B and 
matrices, all of them essentially IV estimators, Considering Eq. (11-47) again, the 
matrix of right-hand side variables is 


Z,= [Y, X,] 
where Y, ism X (g — 1) and X, is n X k. Define 
Wi = [Xt x,] 


FL. R. Klein, “Estimation of Interdependent Systems in Macroeconometrics,” Econometrica, vol. 
37, 1969, pp. 171-192. 


#J. M. Brundy and D. W. Jorgenson, “Efficient Estimation of Simultaneous Equations by 
Instrumental Variables,” Review of Economics and Statistics, vol. $3, 1971, pp. 207-224. . 
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where X* is the matrix of any g — 1 predetermined variables which do not appear 
in the first structural equation. These variables could be chosen from the 
predetermined variables appearing in the structural equations for Y,, as suggested 
by Fisher.t The resultant IV estimator of 8 is (W#’Z,)” 'Wy’y. Repeating this 
procedure for each structural equation yields the preliminary consistent estima- 
tors B and f for insertion in I = —B~'f, and the computations outlined in Eqs. 
(11-55) and (11-56) would then yield the final estimator. Another possibility is to 
define 
wy = [F, X,] 


where F, is a subset of g — I principal components of X. This differs, of course, 
from the Kloek and Mennes procedure, where the principal components were 
used in quasireduced-form estimation to compute Y,. Here the principal compo- 
nents are used as instrumental variables in a first-round estimation of structural 
coefficients, The Brundy-Jorgenson estimator is known as the limited-information 
instrumental variables efficient (LIVE) estimator. The asymptotic variance-covari- 
ance matrix for d is estimated by 


asy var(d) = s2(W{W,) (11-57) 


where W, is defined in Eq. (11-55) and 
pe 2 Ab ae 
n 


n where the 2SLS estimates cannot, 


The LIVE estimates can thus be computed eve 
vary with the variables chosen as 


but the actual point estimates will, of course, 
instruments. 


Limited-Information Maximum Likelihood (Least Variance Ratio) 
Estimators : 
This alternative approach to the estimation of a structural equation preceded the 
development of 2SLS, which has largely replaced it on grounds of greater 
simplicity. Consider again the structural equation 

y=YPt+Xytu 
and rewrite it as 
YB, — Xiy = 4 (11-58) 
where 


y-ly ¥J and a-| | (11-59) 


i “ ic Structure and Estimation in Economy- Wide Econometric Models,” in 
Leite Ean R. Klein, and E. Kuh, Eds., The Brookings Quarterly Econometric 


J. Duesenberry, G. Fromm, 1 
Model of the United States, Rand-MeNally, Skokie, IL, 1965, pp. 589-636. 
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Let us suppose that the endogenous variables have been so numbered that Y, 
constitutes the first g such variables and likewise that X, refers to the first k 
predetermined variables. The likelihood function for the endogenous variables in 
Y, will involve the parameters in the first g rows of the reduced-form matrix I. 
Let these rows be partitioned into the two submatrices [II,, 1, ] which are of 
order g X k and g X (K — k), respectively. We know that 


BII = -T 
The first row of each side of this equation may be written 


[B; 0,JI = [—y' 0,] 


where 0, indicates a row vector of G — g zeros and 0, a row vector of K — k 
zeros. Using the partitioning of II then gives 


BxTly, = —y' (11-60) 
BxIl,, = 0, (11-61) 


Eq. (11-61) constitutes K — k homogeneous equations in the g elements of f,. 
However, one of the 8’s has been set at unity so that we merely need to determine 
the ratios of the elements in B,. This can be done uniquely if the rank of IT,, is 
g — 1. Even in the overidentified case where K — k > g — | and IT,, thus has g 
rows and at least g columns, the rank of II,, cannot exceed g — 1.4 This is 
obvious intuitively since Eq. (11-61) is just a subset of equations from BIT = —T, 
which gives the relations between the true structural coefficients and the true 
reduced-form Coefficients. However, the true II, is unknown, and when it is 
replaced in Eq. (11-61) by, say, the ML estimate Ti,,, this matrix in the 
overidentified case will almost certainly have rank & so that one cannot solve for 
nonzero By, except by arbitrarily dropping one of the equations. 

, The limited-information maximum likelihood (LIML) approach is to maxi- 
mize the likelihood function for the g endogenous variables in Y, subject to the 
restriction that p(T.) = g — 1. This approach was developed by Anderson and 
Rubin.t The application of the method Tequires one to know, in addition to the 
specification of the equation being estimated, merely the predetermined variables 
appearing in the other equations of the model, as in 2SLS. The mathematical 
development of the LIML estimator is complicated and lengthy, but it may be 
shown that it reduces to the choice of the elements of B, to minimize 


mA 
RWB, <eese) 


bh t Pais C. Hood and T. C. Koopmans, Studies in Econometric Method, Wiley, New York, 1953, 


#T. W. Anderson and H. Rubin, “Estimation of the Parameters of Single ion i 
x i a Equation in a 
Complete System of Stochastic Equations,” Annals of Mathematical Statistics, vol. 20, pp. 46-63, 1949. 
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where Wis and Waa are certain matrices of residuals.} The explanation of these 
residuals is given in the following account of least variance ratio (LVR) estima- 
tors. 
Rewrite Eq. (11-58) as 
z=Xyt+u 
where z= YsB 


so that the z vector is a linear combination of the endogenous variables appearing 
in the equation, the coefficients of the combination being the unknown B 
parameters. If z is regressed on Xj, the residual sum of squares is 


zz — 1°X,(X,X,) "Xz = BrYs¥aBs — BuYX,(X{X,)-"X;¥yBs = By Wes Bs 
where Wes = Ys¥a — YEX,(X;X1) X4¥a (11-63) 


Similarly, if z is regressed on all the predetermined variables, X = [X, Xz], the 
residual sum of squares is 


BiWy 0B 
where Waa = Yé¥q — YX(X'X) XY, (11-64) 


The second residual sum of squares will be no greater than the first since the 
second regression includes all the explanatory variables in the first regression X, 
plus the set X,. However, the specification of the structural equation asserts that z 
depends on X, but not on X,. Thus the LVR principle suggests that the estimate 
of B, should be chosen to keep this reduction in the residual sum of squares as 
small as possible, that is, to minimize the ratio — 

, — BWSBs 
BiWssBs 
which is the same criterion as that for the LIML estimator. Differentiating / with 
respect to By and setting the result equal to the zero vector gives 
(Wis — 1Wy,)Bs = 0 (11-65) 


This set of equations will only have a nontrivial solution if the determinantal 
equation 

[Wis — [Waal = 0 
1, which must be solved for the smallest 


is satisfied. This gives a polynomial in N 
¢ estimator f, obtained 


root /. This root is substituted back on Eq. (11-65) and th 


+T. W. Anderson and H. Rubin, op. cit; see also W. C. Hood and T. C. Koopmans, op. cit., 
Chap. 6, Hood and Koopmans arrive at Eq. (11-62) by a different method from the original approach 
of Anderson and Rubin, who maximized the likelihood function subject to appropriate constraints by 
using Lagrange multipliers. Hood and Koopmans start with the likelihood function for the complete 
model of G equations for all G endogenous variables and then, by a series of stepwise maximizations, 
eliminate from the likelihood function all parameters other than those of the equation to be estimated. 


Finally, even 7 is eliminated and the concentrated likelihood function expressed in term of By. 
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from 
(Ws, — 7 Wy )Bs = 0 (11-66) 
by setting the first element of 6, equal to unity. Defining 
a= YsBy 
and regressing 2 on X, gives 
= (XX) "XN By (11-67) 


Equations (11-66) and (11-67) define the LIML estimates of the structural 
equation. The LIML estimators have the same asymptotic variance-covariance 
matrix as 2SLS. The estimates of the asymptotic variances, however, will differ 
since s? is computed from the estimated structural coefficients, which will be 
different in the two cases. 


Three-Stage Least Squares and Full-Information Maximum Likelihood 


The estimators considered so far, namely ILS, 2SLS, LIVE, and LIML, are all 
essentially limited-information estimators in that in the estimation of any struc- 
tural equation complete information on all the other structural equations in the 
model is not taken into account.} In principle information on the complete 
structure, if correct, will yield estimators with greater asymptotic efficiency than 
that attainable by limited-information methods. There are two main full-informa- 
tion methods, namely, three-stage least squares, (3SLS) and full-information 
maximum likelihood (FIML). 

The initial development of 3SLS is due to Zellner and Theil.+ Consider again 
the general linear model containing G jointly dependent endogenous variables 
and K predetermined variables. The ith equation may be written 

y= YB, + Xi, + u, (11-68) 
where y, is ann X 1 vector of sample observations on the dependent variable in 
the /th equation, Y, is an n X g, matrix of observations on the other endogenous 
variables in the equation, X, is an n X k, matrix of observations on the prede- 
termined variables in the equation, B, and y, are vectors of structural parameters, 
and u, is a vector of disturbances. Rewrite Eq. (11-68) as 


y= Z,8, + u, (11-69) 
where Z,=[Y¥, X,) and 8 = [| 
If Eq. (11-69) is premultiplied by X, the n x K matrix of all the predetermined 
variables in the model, then 
Kiya X29) NM, ory ee lene G (11-70) 


¥ An exception is the LIVE estimator where initial estimates of the B and I matrices are made to 
derive an estimate of IT. 


¥A. Zellner and H. Theil, “Three Stage Least Squares: Simultaneous Estimation of Simultaneous 
Equations,” Econometrica, vol. 30, 1962, pp. 54-78. 
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The variance-covariance matrix of the disturbance term in Eq. (11-70) is 
E(X’uyu,X) = 0,,X'X (11-71) 

on the assumption that E(u,u’,) = o,,1. Considering Eq. (11-70) as a relationship 
between a dependent variable X’y, and explanatory variables X’Z,, the nonspheri- 
cal disturbance matrix in Eq. (11-71) suggests using generalized least squares. The 
GLS estimator of 8, is then 

a, = [Z;x(x°x)'X’Z,] 'Z;X(X'X)'Xy, (11-72) 
Equation (11-72) is simply another way of writing the 2SLS estimator of Eq. 
(11-69), as may be verified by substituting for Z,, multiplying out, and comparing 
with the original expression for the 2SLS estimator in Eq. (11-45). 

We may note in passing that Eq. (11-72) affords a simple demonstration of 
the equivalence of 2SLS and ILS in the case of a just identified equation. The 
order condition for exact identification of the ‘th structural equation is 

K-k,=g8,-1 or ikphanodl=* 
Thus Z, is of order n x K so that X’Z, is of order K X K and may be assumed to 
be nonsingular. In this special case Eq. (1 1-72) gives 
4, = (X°Z,)(XX)(Z;X) (ZX) 'Xy, = (XZ) 1X, 
which, from Eq. (11-42), is seen to be the ILS estimator for the th structural 


equation. a 
We also know from the discussion of GLS estimators in Chap. 8 that it is 


possible to interpret the GLS estimator as equivalent to the estimator given by the 
application of OLS to suitably transformed data. The present case may be so 
interpreted, and this leads to a considerable simplification in the presentation of 


the 3SLS estimator. Consider again Eq. (11-70) whose disturbance has a variance 
matrix given by o,,X’X. Since X’X is positive definite, we know from Chap. 4 that 
its inverse is also positive definite and that a nonsingular matrix P exists such that 


(XX), = PP’ (11-73) 


from which it follows that 
P’X’XP = I (11-74) 


Premultiplying Eq. (1 1-70) by P” gives 
PX’y, = P’X’Z,8, + P’X'u, 
w, = W5, + 
w= PX, 
W, = PX7Z, 
y= PX, 
r the disturbance term in Eq. (11-75) is 
E(vy/) = E(P’X’uu,XP) 
= 0,,P'X’XP 


* 6,1 


or (11-75) 


where 


The variance matrix fo 


(11-76) 
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The application of OLS to Eq. (11-75) then gives 
4, = (W/W) 'Wyw, (11-77) 


which is easily seen to reduce to the 2SLS estimator in Eq. (11-72). 
Collecting all G structural equations gives 


8 y 
Ww, 1 1 
a w, 0 o IIs, % 
MO ow, Lees Veni +1". (11-78) 
“ oe W, &, 2 


or, more compactly, 
w=Wd+y¥ (11-79) 


where the definition of the symbols in Eq. (11-79) is obvious from the comparison 
with Eq. (11-78). The variance matrix for the v vector is 


ol op! Gicl 
V=E(w)=] 1 o,l --- ogl]/= Sel (11-80) 
eo 2 | 


The variance terms in Eq. (11-80) follow directly from Eq. (11-76). The typical 
covariance term is 


E(vyj) = E(P’X'uy, XP) = 9,1 


Thus the basic assumption is that each structural equation has a homoscedastic 
nonautocorrelated error term and that the disturbances in different structural 
equations may be contemporaneously correlated. Provided that at least some 4; 
are nonzero, the arguments underlying the Zellner SURE estimator, already 
considered in Chap. 8, would suggest that any of the G equations defined by Eq. 
(11-75) would be more efficiently estimated as a member of the complete set 
defined in Eqs. (11-78) and (11-79). 3SLS is, in fact, simply the SURE estimator 
of 8 in Eq. (11-79). The only difficulty is that the = matrix in Eq. (11-80) is 
unknown. The Zellner-Theil suggestion is to estimate first each structural equa- 
tion by 2SLS, giving the residual vectors 


t=y,— 24; i=1,...,6 
where d, is the 2SLS estimator of 6,. The elements of © are then estimated by 


Ay 
$j; = Seri for alli, j 


giving z 
V=e1 
The 3SLS estimator of 6 is then 

dices = (WV"'W) WV! (11-81) 
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with asymptotic variance matrix estimated by 
asy var(d3s15) = (wv 'w) iP! 


Substituting in Eq. (11-81) for the elements of w and W, the 3SLS estimator may 
be expressed in terms of the original data as 


s"ZiX(X'K)"'X'Z, | s!2Z,X(MK)'X'Z, «+ — s!°Z,.X(X’X)"'K'Z, 
dasis = |)57!ZaX(XMg MZ sae KRM MZ << 32924 X(XX) | XZ, 
sZX(XX))'XZ, 9 sCZEX(XK) XZ, sO), X(X'X) om 


GC 
¥ s¥Z,X(X'X)'Xy, 


jnt 


G 
x} ¥ s?/Z,x(X'x) 'X’y, (11-82) 
yal 


é 
¥ s%Z,X(X'X) 'X’y, 


got 


where the s‘/ denote the elements in 3". 

A crucial question concerns the conditions under which 3SLS will be asymp- 
totically more efficient than 2SLS. A necessary condition for the superior efficiency 
of a full-information, or complete-system, method of estimation over a limited- 
information method is that the specification of the complete model should be 
correct. In many systems this is a formidable requirement, and the larger and 
more detailed the system, the more difficult does it become. Even granted a 
correct full-system specification, there are two conditions under which 2SLS and 
3SLS will give identical point estimates with identical asymptotic sampling 


variances. The first is 
6,,=0  foralli*j 


that is, the contemporaneous correlations between the disturbances in different 
ions are all zero. The equivalence follows directly from the result 


for the SURE model that a diagonal = matrix gives equality between the SURE 


and OLS coefficients.f It may also be seen directly by substituting s‘/ = 0 in Eq. 
(11-82). The other condition under which one would find equivalence of 2SLS 
identified. We have already 


and 3SLS estimators is all equations being exactly E 
seen that the order condition for exact identification of the ith equation leads to 
the result that X’Z, is of order K x K and may be assumed to be nonsingular. The 


structural equat 


+ See Problem 8-2. 

+ Notice that Eq. ( 11-82) refers to the feasible 3SLS estimator where =~ ' has been replaced by 
&-" 1B is diagonal, then so is E~' and s‘/ = 0 for all i + j. However, even if this condition is 
satisfied, the s,, (and hence the s‘/) estimated from 2SLS residuals will in general not vanish, 
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P matrix defined in Eq. (11-73) is also square of order K and nonsingular. Thus 
W, = PX’Z, 
is K x K and nonsingular. This result holds for all i= 1,...,G. Thus the 
block-diagonal W matrix defined in Eqs. (11-78) and (11-79) is nonsingular and 
each component submatrix is nonsingular. The 3SLS estimator defined in Eq. 
(11-81) may then be written 
diss = W7'0(W’) WV w 
=W''w 
Under the same assumption the 2SLS estimator for the ith equation defined in 
Eg. (11-77) reduces to 


4, = W,'w, 
Thus the collection of 2SLS estimators for the complete system may be written 
d, w 
ay 1 
4, Ww, 0 0 % 
dus=| . [=| 0 w,! Sas 'w 
dg 9 0 Wc 


which is identical with dy; s. 

So far we have assumed that all the structural equations in the model are 
identified. Before attempting to apply 3SLS in practice one must omit all 
unidentified equations and also all identities, since the latter have zero dis- 
turbances which would render the 2 matrix singular. Suppose that there remain G 
identified equations of which G, are exactly identified and G4, overidentified. 
Zellner and Theil have shown that the 3SLS estimator of the Gy, equations, 
treated as a complete group, is the same as that obtained from the application of 
3SLS to the complete system of G equations. Thus it is computationally efficient 
to obtain the 3SLS estimates in two steps. First compute the 3SLS estimates of 
the overidentified equations. The 3SLS estimates of the just identified equations 
are then obtained by adding to the relevant 2SLS estimates a linear combination 
of the 3SLS estimates of the overidentified equations,+ 


Full-Information Maximum Likelihood 


As with 3SLS this is a complete system method of estimation. It is computa- 
tionally more expensive than 3SLS as it involves the solution of nonlinear 
equations. We will merely sketch the outlines of the approach. Consider again the 
linear simultaneous equation model in G current endogenous variables 


By, +Tx,=u, t=1,...,n 


+ The precise formula is given in A. Zellner and H. Theil, “Three Stage Least Squares: Simulta- 
neous Estimation of Simultaneous Equations,” Econometrica, vol. 30, 1962, p. 67. 


SIMULTANEOUS EQUATION SYSTEMS 491 


with E(u,)=0 t= Tlajn 
E(um,) == 


If it is assumed that the G disturbances follow a multivariate normal distribution, 
we may write 


ele She Sa ee 
1.) = areas o(- 5u.2-'u,) 


Assuming, in addition, that the u vectors are serially uncorrelated, the likelihood 
for the n vectors U,,U5,..., U, is then 


p(u,,U3,-.-,U,) = Tf) 
= (20) "(det 2) "Pexp(—5 Ewa | 


The likelihood for y,,Yos--++ Yn iS 
P(Yss¥aeeeey Mp) = (2m) "det BI"(det B) ” 


xeo|=} ¥ (By, + Px,)'=" (By, + rx) (11-83) 
t=1 


nf 


If we write 
y 
By, + Ix, = [B riz] = Az, 
the exponent in the likelihood in Eq. (11-83) can be written 
: 1 13-147" 
-+ SAS Az = — atr(ZA >" 'AZ’) 


t=! 


aye 5d 'AZ/ZA’) 


where 
y Xi 
z=[Y X]=|% %2 
Ye Xn 


is the n X (G+ K) matrix of observations on all the endogenous and _prede- 


termined variables. Defining 
= lyn 
n 


tr(Z-'AZ/ZA’) = ntr(Z~'AMA’) 
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Table 11-1 Estimation methods in the models of Project Link 


Total Number of 

numberof stochastic Estimation 
Country Datat equations equations = method 
Australia Q 82 42 OLS 
Austria A 128 54 OLS 
Belgium Q 25 19 OLs 
Canada A 183 “4 OLS 
Finland Q 144 60 OLS 
France A 32 19 OLS 
West Germany A 137 51 FIML 
Italy Q 104 53 OLS 
Japan Q 78 43 OLS 
Netherlands A 87 13 LIML and 2SLS 
Sweden A 133 15 OLS 
United Kingdom Q 226 106 OLS 
United States Q 207 ui) OLS 
Developing America A 12 ll OLS 
Developing South and East Asia A 14 13 OLS 
Developing Middle East and Libya A 10 9 OLS 
Developing Africa less Libya A in) 10 OLS 


+Q—quarterly data; A—annual data. 
Source: J, Waelbroeck, The Models of Project Link, North-Holland, Amsterdam, 1976. 


and so the logarithm of the likelihood in Eq. (11-83) may be written 
L(A, ) = constant + nIn|det B| — Findet 3 5 tr(2~'AMA’) (11-84) 


The FIML estimator results from the maximization of L(A, 2) with respect to the 
elements of A and 2. The equations are nonlinear and computationally expensive, 
though less so with each advance in computer technology. The asymptotic 
variance matrix of the FIML estimator, however, turns out to be identical with 
that for 3SLS, thus indicating the asymptotic efficiency of the latter method.+ 

This feature, combined with its less severe computational problems, leads 
some authors to recommend 3SLS over FIML. Most practical applications of 
3SLS or FIML occur, not surprisingly, with fairly small models. What is perhaps 
surprising is the continued dominance of OLS over all other methods, especially 
in the estimation of major econometric models. A recent study by Waelbroeck$ 
documents the main features of the various countrywide econometric models in 
Project Link. Of the 17 models summarized, OLS is the estimating method in 15, 
FIML is used in only one model, and a combination of LIML and 2SLS in the 
remaining model. Details are given in Table 11-1. 


+ See H. Theil, Principles of Econometrics, Wiley, New York, 1971, pp. 524-527. 
$I. Waelbroeck, The Models of Project Link, North-Holland, Amsterdam, 1976. 
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PROBLEMS 


11-1 For the model defined by Eqs. (11-1) and (11-2) show that 


> Bri, + 02 
plim(b) = —*—+ > 
) ™m..+ 02 8 


where b is the slope of OLS regression of C on Y and 


plim[ 22(2z, - z)| 


11-2 Prove the equivalence between the alternative expressions for the 2SLS estimator in Eqs. (11-45) 
and (11-46). 
11-3 The structure of the Klein model is 


C= a9 + a(W, + We) + all + ag) +m 

T= Bo + BT + BylI_, + B3K—) + v2 

W, =r t+ n(¥ + T- We) + (V+ T- We)-1 + wet ts 
Y=C+/+G 

I= Y¥- Went 

KK par 


The six endogenous variables are Y (output), C (consumption), / (net investment), W, (private wages), 
II (profits), and K (capital stock at year-end). The four exogenous variables are G (government 
nonwage expenditure), We (public wages), T (business taxes), and 1 (time). 

Examine the rank condition for the identifiability of the consumption function. 


11-4 Tintner’s model of the U.S, meat market is specified as follows: 
y(t) = a + a y2(t) + @x,(t) + u(t) demand 
yi (0) = Bo + Biv2(t) + Box2(t) + Byx3(0) + u2() supply 


(a) Determine the identification status of each equation. ; 
(b) Suppose it is known a priori that B,/B; = k where k is a known number, Determine the 


identification status of each equation under this specification. , 
(©) Suppose the model stated at the outset is changed by specifying that a, = B, = By = 9. 
What prior restrictions (if any) on the disturbance variance-covariance matrix would lead to the 


identification of both equations? F 
(University of Michigan, 1981) 
11-5 In the model 
Vir + Bidar + Wie = Me 


Voy + Boy Yar + Yor%a1 + Yos¥3r = M20 


the p’s are endogenous, the x’s exogenous, and uj = [uj,M2;] is a vector of serially independent 
normal random disturbances with mean zero vector and the same nonsingular covariance matrix for 
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each 1. Given the following sample second moment matrix: 


= 
CUNnaR 
o-nsoa 
cCOo-NN 
o-o-w& 
-oscco 


calculate the LIML and 2SLS estimates of 8) and y,). 

(University of Michigan, 1981) 
11-6 An investigator has specified the following two models and proposes to use them in some 
empirical work with macroeconomic time series data. 


Model 1: Cp = OY, + aym,_ + Uy 
i, = Biy, + Bor, + ux 
Maeeth 
Jointly dependent variables: —¢,, /,, 


Predetermined variables: hem) 
Model 2: m, = yr, + y2m,_, + Oy, 
7, = 8m, + 8ym,_, + 83y, + Oy 
Jointly dependent variables; = m,, 7, 
Predetermined variables: M134 
(a) Assess the identifiability of the parameters that appear as coefficients in the above two 


models (treating the two models separately). 

(b) Obtain the reduced-form equation for y, in model | and the reduced-form equation for r, in 
model 2. 

(¢) Assess the identifiability of the two-equation model comprising the reduced-form equation 
for y, in model | (an IS curve) and the reduced-form equation for r, in model 2 (an LM curve). 

(Yale University, 1980) 

11-7 Suppose the following sample second moment matrix (based on 36 observations) has been 
obtained for the variables in the Tintner meat model of Problem 11-4: 


J2 x X2 xy 
0 1 0 ik 
10 saat = 0 
=i 1 0 0 
ai 0 1 0 
0 0 0 1 


(a) Estimate the parameters a and a, by 2SLS and test the hypothesis a, = 0 against the 
alternative a, * 0. 


(b) Repeat part (a) using IV estimates of a, and a, obtained with x, as an instrument for y2, 
and xj, as its own instrument. 


(Yale University, 1980) 
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11-8 (a) Assess the identification of the parameters of the following five-equation system: 
Vie + Bidar + Bia dar + i210 + NisZae = Mie 
Var + Bos Vai + Bas sr + Yo2Z24 = Uae 
Yar + Yaris + YaaZar = May 
Bar Vir + Baz.¥31 + Yar + Yaa220 + YasZar = Yar 
2 ys + Ys1 — 220 = 0 


(b) How are your conclusions altered if y3; = 0? Comment. 
(c) Briefly explain how you would estimate the parameters of this model. What can be said 


about the parameters of the second equation? 


(UL, 1979) 
11-9 The model given by 
Vie = Bidar + YrZue + NiZ20 + &r () 
Yar = Boy Vir + Yas23n + 20 (2) 
generates the following matrix of second moments: 
yy v2 zy 2) 23 
vy 3.5 3 1 ! 0 
V2 11.5 1 3 4 
rs 1 0 0 
Z 1 1 
23 2 
Calculate: 
(a) Least-squares estimates of the unrestricted reduced-form parameters 
(b) ILS estimates of the parameters of Eq. (1) 
(c) 2SLS estimates of the parameters of Eq. (2) 
(d) The restricted reduced form derived from parts (b) and (¢) 
(e) A consistent estimate of E(e)2€2r) = %2 
(UL, 1973) 
11-10 Let the model be 
dae + Biadar + Mkt, + Y9%30 = Me 
Bay vie + Yar + Yo2Xar + Yo3%31 = Hr () 
and suppose the observations on the variables are 
1/2714 ooh he 4789 10 Il 
x=]4 4 8 10 12 20 walgats 2 21) Q) 
2) .0 Males meio) ct 


and order conditions for identification on the basis of Eqs. (1) and 


rocedure for both equations. 
(2) investigate whether the answer un 
by reformulating the model in Eqs. (1). 


(a) Examine the rank 
suggest a suitable estimation p! 

(b) In the light of Eqs. 
and, if so, in what way. Interpret 


der part (a) needs modification 


(UL, 1971) 


11-11 Let the model be 
Vie + Bidar + Ya¥20 + Y3%3r = Me 


Boris + Yar + Yarn + Y24%4e = H2e 
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If the second moment matrices of a sample of 100 observations are 


. 80.0 -—4.0 a 200 O 23.0" 5:0 
WMS a0 vaso) ee Lacs Pe 05 —1.0 
OSE earty 
Oe 720 0) 0 
0 0 10 0 
0 0 0 0.5 
find the 2SLS estimates of the coefficients of the first equation and their standard errors. 
(UL, 1970) 
11-12 In the following market model 


Supply Q, = BP + Yo + Mr 
Demand = Q, = Bay P, + Yoo + ¥2iZu4 + Y22 221 + M21 


quantity Q, and price P, are endogenous, while income Z,, and the price of some other good Z;, are 
exogenous. If the supply function is estimated directly by least squares, will the resulting estimate of 
B,, be biased? If so, in which direction will the bias occur? 


(UL, 1972) 
11-13 If 
Yur = Bryer Maku + YX. + Mie 
Yar = Bay Yin YasXar + Ua, 
0 0 0 10 20 
and XX=|0 5 O xXY=/20 10 
0 0 10 30 20 


estimate the parameters in the model and comment on your results. If 83, is known to be equal to 0.6, 
would you modify your estimation procedure, and if so how? 


11-14 The X’X matrix for all the exogenous variables in a model is 


Oat d os 
ves" || CUNY er Ft 
SP Romb hai teeaikweod 

T(t) 0 bball 


Only the first of these exogenous variables has a nonzero coefficient in a structural equation to be 


estimated by 2SLS. This equation includes two endogenous variables, and the least-squares estimates 
of the reduced-form coefficients for these two variables are 


e tips ei 
lo-bed = 
Taking the first endogenous variable as the dependent variable, state and solve the equation for the 


2SLS estimates. 
11-15 For the model 


Vie Birdae + Wide + My 


Jae = Bor Yue + YarXa0 + YaaXaet Uy 
you are given the following information: 


1, The least-squares estimates of the reduced-form coefficients are 


[ urn Cm 

10 10 5 

2. The estimates of variance of the errors of the coefficients in the first reduced-form equation are |. 
0.5, 0.1. 


3. The corresponding covariances are estimated to be all zero. 
4. The estimated variance of the error on the first reduced-form equation is 2.0. 
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Use this information to reconstruct the 2SLS equations for the estimates of the coefficients of the 


first structural equation, and compute these estimates. 
(UL, 1969) 


11-16 
Vie = BiYar + Bisa + Yk + Me 


is one equation in a three-equation model which contains three other exogenous variables x2,, x3,, and 
Observations give the following matrices: 


20: HIS es Biel bhaes 5 
yy=| 15 6 —45] YX=|0 4 12 -5| XX= 
= kay Te 0 —-2 -12 10 


Xap 


cone 
conoo 
wucse 


coce 


Obtain 2SLS estimates of the parameters of the equation and estimate their standard errors (on the 


assumption that the sample consisted of 30 observation points). 
(UL, 1968) 


CHAPTER 


TWELVE 


ECONOMETRICS IN PRACTICE: 
PROBLEMS AND PERSPECTIVES 


A careful study of the material covered in the previous eleven chapters would not, 
unfortunately, equip the reader to conduct a successful piece of applied econo- 
metric research, since that involves many more problems than those already 
discussed. We will tentatively explore some of these issues in the present chapter, 
but the reality should be faced at the outset that it is not feasible to write a 
comprehensive manual that would prepare applied econometricians for all the 
problems that can arise in a wide variety of research projects. Successful econo- 
metric modeling is not a collection of mechanistic and routine procedures but 
more of an art requiring wide-ranging knowledge and judgment. Such an art is 
best learned by practice, hopefully with talented supervisors and colleagues, and 
by study of “best practice” examples. It is, however, not always easy to find the 
latter. Indeed a very instructive book might be written under the title, How NOT 
to Do Econometrics, with every chapter illustrated by one or more published 
articles. The author of such a book would have to time its publication carefully in 
relation to his own impending demise or retirement from contact with his 
professional colleagues: he might also face the difficult problem of choosing some 
of his own previous work for inclusion. 

There is a widespread view that econometrics has in some sense not lived up 
to its early promise, and there is much scepticism about the value of the plethora 
of empirical results embedded in the literature. This state of affairs should not be 
too surprising. There is, after all, a sound proposition in economics that the use of 
a good or service tends to expand to the point at which price and marginal utility 
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are equated. If computers are essentially free goods, if “researchers” can plug into 
a data bank without any understanding of where the series come from or how 
they were constructed, if they can press buttons to implement computer programs 
whose contents they dimly comprehend, it then follows, as night follows the day, 
that some work of zero worth will emerge. Indeed, given the uncertainty inherent 
in the research process, compounded by the fallibility of the researcher, some 
outputs may err on the wrong side of zero and be positively dangerous. Our twin 
defenses against further encroachments by a flow of dubious work lie in improv- 
ing still further the quality of the editorial screening process and raising also the 
quality of the training given to would-be practitioners. 


The Origins and Objectives of an Econometric Research Project 


The origins and objectives of econometric research projects are as many and 
various as the persons and groups who undertake them. It may be a lone graduate 
student scratching his head or rummaging in his supervisor's “bottom drawer” 
for a thesis project; it may be an academic intrigued by a theoretical debate in the 
literature or impelled by some idea of her own; it may be a public or commercial 
research group building or expanding an econometric model to be used for 
short-term forecasting. In my own experience the writing of this book was 
delayed for several years by two separate phone calls. One led to a year’s work 
estimating demand functions for oil for the major industrial countries of the 
world. A subsequent, but not unrelated, call led to two years’ intense activity as 
the econometric consultant on the construction of a world model of energy 
demands and supplies. It is platitudinous but important to say that in all cases 
one should be as clear and precise as possible about the objectives of the research, 
since these condition the design and layout of the project, though of course they 
may have to be revised as the project proceeds. 


Data and Model Specification 

These two topics are inextricably linked. The model specification will have strong 
implications for the data required and, conversely, data limitations may constrain 
the feasible specification. As an illustration suppose an objective is the estimation 
of a demand function for crude oil in the United Kingdom that might then be 
used to forecast demand, conditional on various assumptions about the future 
paths of income and relative prices. The first step is to investigate the range of 
possible strategies. Should one estimate an aggregative demand function for 
crude, using some measure of “income” and some relative price? If 50, should tg 
income measure be real GDP, or an index of industrial production, or what? 
What, in turn, are the appropriate price series from which an index of relative 
prices should be constructed? Or, alternatively, should one use a more GURBBETC: 
gated approach looking at the final demands for specific Le ae ite 
refining process, such as gasoline, jet fuel, heating oil, and so on, and s| Ge one 
disaggregate also by consuming sector, whether residential, Se a ba 
dustrial, public utilities, and so forth? The disaggregated approach would also 
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require the modeling of the refining decision and of the relationship between the 
price of crude and the prices of refined products. In this decision process it is of 
great importance to have as much knowledge as possible of what may be called 
the “institutional realities” of the situation, specifically in this case such things as 
the nature of the refining process and the constraints on the refining decision, the 
quantitative importance of various groups of consumers, and the crucial factors in 
their decision processes. An econometrician coming cold to the study would run 
the risk of very slow progress with much searching through inappropriate formu- 
lations. In my own experience collaboration with an experienced oil specialist 
greatly improved the research efficiency. 

Knowledge of the “institutional realities” is, of course, valuable in all areas. 
In a study of cost-output relationships in coal mining this author felt it necessary 
to don a safety helmet and get to the coal face in the narrow and twisting seams 
of the Lancashire coal field in order to see at first hand the nature of the 
production process before sitting down to peruse the statistics at the regional 
headquarters of the National Coal Board. Similarly in studies of scale, costs, and 
profitability in road passenger transport and of cost-output variations in a 
multiple-product firm the author spent time at each firm talking to accountants 
and managers to study their accounting and decision processes before extracting 
the relevant data by hand from the firm’s records.} To take a final data problem, 
monetary theory postulates the demand for money to be positively related to 
income and negatively related to the rate of interest. Each of the three nouns in 
this proposition raises formidable problems of definition and measurement. There 
are numerous definitions of money and almost continual evolution of payments 
technology, there are many interest rates, and even income is not unambiguous. 

When appropriate data series have been identified, the next decision in 
time-series contexts is what data period (hourly, weekly, monthly, quarterly, 
annual, or whatever) to use. Again if we had institutional information about 
decision processes (who decides when about what) we could make the appropriate 
choice. If, for example, production decisions are revised at the start of each 
month, a model of the production decision employing monthly data would have 
the best chance of capturing the essential features of the process. Quarterly or 
annual data would in this case involve an inappropriate aggregation over time, 
thus making it difficult, if not impossible, to determine the lag structure. Often, 
however, there is little firm information about decision procedures, and the main 
choice between quarterly and annual data is based largely on a mixture of 
empirical considerations and the objectives of the modeling process. As Table 
11-1 shows, the macroeconometric models for the developed economies are split 
roughly evenly between those based on quarterly and those based on annual data. 

By far the most difficult problem of all is the initial specification of the 
model, be it a single equation or a set of equations. By specification we mean the 


¥ For these and other studies see J. Johnston, Statistical Cost Analysis, McGraw-Hill, New York, 
1960. 
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following: 


1. The listing of explanatory variables, including lagged values, in each equation 
2. The functional form relating these variables to the dependent variable 
3. The stochastic properties of the disturbance term or terms 


Economic theory is mostly about equilibrium situations and contains little in 
the way of systematically developed dynamic theory. Thus it cannot be expected 
to yield strong insights about lag structure. Nor can it be expected to indicate the 
correct functional form. Thus items | and 2 inevitably lead to a certain amount of 
interaction between theory and data. This interaction also impinges on item 3, 
which essentially consists of assumptions about unobservable variables. However, 
each specification under items | and 2 provides estimates of the unobservables, 
and the interaction between specification and data usually continues until the 
researcher feels that a “reasonable” set of results under items 1, 2, and 3 has been 
obtained. 


Data Mining and Specification Searches 
This interactive process has been labeled data mining or, more recently and less 
pejoratively, specification searches. At one extreme it is alleged that data mining 
invalidates all the conventional significance levels or, even more strongly, that the 
final results are quite valueless, since the researcher has gone on a “fishing 
expedition” or beaten the data set into submission until they finally yielded the 
desired conclusion. At the other extreme it is suggested that if the set of models 
includes the “true” model, that model will have the smallest residual variance and 
hence the highest true R?, so that searching for the best fit to the sample data is a 
reasonable and sensible procedure. : 
Let us take a look at the data mining problem by means of two hypothetical 
examples. 
variation in a variable y. He 
on the basis of his a priori 
“lousy” for 


Example 1. A researcher's objective is to explain the 
has 10 candidate explanatory variables x,,.--, X10 ‘ 
theory. The underlying theory can only be characterized as 


1. It specifies that only three of the possible 10 variables actually influence y, 


but it does not know which three, and, more seriously, : 
2. The theory is totally in error for, in fact, none of the 10 variables has any 


effect on y. 


The first defect of the theory actually appears as an advantage to our 
researcher for his computer cannot handle more than three explanatory variables 


justice to this topic. The interested reader will 


ief discussion cannot to do adequate der w 
— a Het and enlightening E. E. Leamer, Specification 


find much nourishment in the elegant, entertaining, 
Searches, Wiley, New York, 1978. 
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at a time. Thus he computes all '°C, = 120 possible multiple regressions and the 
attendant F statistics for the overall fit. The true value of all 120 population F 
statistics is of course zero, but the reader would not be surprised to find that our 
researcher discovers some significant sample regressions.} His theoretical and 
institutional knowledge enables him to write a plausible commentary on these 
regressions and perhaps select one as the seemingly best theory for the explana- 
tion of y. Sending the write-up to an editor, who likes to publish “significant” 
results, guarantees another “scientific” paper and a further small step by the 
author up the academic ladder. 

Another variant of Example 12-1 is a theory that only identifies the three 
candidate variables, none of which, in fact, has any relevance to y. A series of 
investigators drawing different sets of sample data from y, x), X2, ¥3 fail to find a 
significant regression, consigning their computer printout to the waste paper 
basket or filing cabinet, according to temperament. In either case their profes- 
sional colleagues are unaware of this accumulation of “negative” results, and so 
testing of the theory continues. Working at any conventional level of significance, 
it is only a matter of time until a set of sample data is drawn that yields a 
“significant” result, which will, of course, have a good chance of being published. 


The moral of Example 12-1 is clear. In an area where theory is poor and 
provides little guidance on specification to the researcher, data mining is a highly 
dangerous activity. Combined with the propensity of editors to publish only 
significant results, it can in extreme cases result in the publication of falsehoods 
and the suppression of truth.t However, take heart, faint reader, the above surely 
cannot be a description of economics, the queen of the social sciences, richly 
endowed with well articulated theory. Consider then Example 12-2. 


Example 2. The minister of petroleum in the mythical oil-rich country of Sandia 
desires to know the demand function for crude oil so that he may better inject 
some good sense and realism into the next round of cartel discussions. Having 


+ The 120 models may be represented by 
y = constant + B,x, + Bjx, + Bix, tu for alli, j,k; i =j*k 
In each case the null hypothesis is 
Ho: B, = B, = By 


Working at the 5 percent level of significance, the probability of accepting the null hypothesis for any 
specific model is 0,95. Assuming independence of the models, the probability of accepting the null 
hypothesis for all the models considered is (0.95)!2° = 0.0021. Thus the chance that the researcher 
finds at least one “significant” regression is 0.998. Working at the more stringent | percent level of 
significance, the probability of finding at least one significant regression is still as high as 0.70. The 
models will not all be independent of each other because of overlapping explanatory variables, so 
these startling probabilities need not be taken too seriously, but they do indicate the nature of the 
potential problem associated with data mining. : 

+A small but constructive step toward addressing this problem was taken a few years ago by the 
editors of the Journal of Political Economy, who initiated a section for the publication of “confirma- 
tions and contradictions.” 
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ample resources at his disposal, he commissions four separate econometricians to 
estimate and deliver such a demand function. All four have access to the same 
data base, namely, all the statistics ever published in the world plus the internal 
files of the Sandia ministry of petroleum, but they are to work completely 
independently. 

Econometrician A is an able econometrician, trained in a good graduate 
school and already prewarned of the sin of data mining. After much cogitation, 
research, and study of the institutional realities, he specifies his demand function, 
estimates it by the appropriate procedures, and sends off the results to Sandia. 
Econometrician B is a better econometrician, who went to a better graduate 
school than A. He too is not about to engage in data mining, so he formulates his 
specification, which happens to differ somewhat from that of A, estimates the 
equation, and dispatches the results. Econometrician C is a clever chap from 
Cambridge (either one). He follows the same procedure as A and B, but as one 
might expect, his a priori specification is, in truth, superior to theirs. Econometri- 
cian D is a data miner from Dublin, who unhappily never had the good fortune to 
go to graduate school nor even to attend a lecture on statistics, but nonetheless 
has a certain degree of native intelligence. As benefits an Irishman, he has been 
warned about so many sins that he has completely forgotten the sin of data 
mining. His first attempt at the problem just happens to be the specification used 
by A. However, D does not much like the results and respecifies, just happening 
now to arrive at the specification used by B. The results of that are still not quite 
to his pleasing, so he respecifies once more and now happens to hit on the 
specification used by C. The results of that please him and are sent off to Sandia, 
but he does not confuse the minister by including the results of his earlier and, to 
him, unsatisfactory specifications. , raat 

Suppose we are privileged to have one further piece of information, which is 
that the C/D specification is the true and correct one. The statistical purist would 
presumably congratulate C and criticize D. However, their standard errors, 
confidence intervals, and associated F statistics are identical. Classical inference 
establishes the properties of estimators and tests of hypotheses by examining what 
might be expected to happen in repeated sampling from a given population or 
model. If the model has not been correctly specified, the tests are strictly invalid 
and the various probability statements are not correct. In the present hypothetical 
example inferences based on the A or B specifications would, strictly speaking, 
not be correct, while those based on the C/D specification are. Data mining has 
only enabled D to make good the defects in his education and has been beneficial 
rather than damaging. Finally we may observe that most classical procedures a 
fairly robust to specification errors, and in practice, probably no finite model wi ; 
ever be the “true and correct” model so we should not be slavish devotees o 


spuriously precise significance levels. 


+The development of econometric theory has been heavily influenced by the ae ae ae 
Cowles Commission, which emphasized problems of equation error to the almost total exc ta : 
problems of measurement error. Little is known about significance levels or the relative properties o! 
different estimators when these problems jointly coexist, as indeed they do in practice. 
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We conclude that the circumstances of Example 12-2 are closer to those of 
real economic research than those of Example 12-1, that interaction between 
theory and data is both inevitable and, indeed, desirable, and we turn now to a 
discussion of some specific guides to that respecification. 


Criteria for Model Selection 


Residual variance (R?) criterion. Most of the operational criteria have been 
developed in the context of a single equation model. The first is the residual 
variance, or R?, criterion. Suppose there are just two competing models for the 
explanation of y, namely, 


y =X,B, +4, and y = X,B, + u, 
where X, is nonstochastic, of order n X k,, and of full column rank. Suppose that, 


in fact, the first model is correct. If the second model is fitted, the vector of OLS 
residuals is 


e, = Moy 
= M,(X,B, + u,) 
where M, =1— X,(X,X,)"'X) 


Thus the residual sum of squares is 
ese, = B}X|M,X,B, + 2B;X{M.u, + ui{M,u, 
Taking expectations 
E(e;e,) = B{X{M,X,B, + (n — k,)o7 (12-1) 


Since M, is idempotent, the quadratic form on the right-hand side of Eq. (12-1) is 
positive semidefinite. Defining s} = e5e,/(n — k), it then follows that 


E(s}) > 0? 
If the first (and correct) model is fitted, we know from Sec. 5-3 that 
E(st) = 07 
where 
(n ~ ky)s7 = eye, = ¥'¥ — ¥X,(XiX,) 'Xiy 
Thus} 


E(s?) < E(s3) (12-2) 


Notice that the inequality is in terms of the expected values of the residual 
variances. In practice we can only compare the estimated residual variances. It is. 
of course, possible for s} to be less than s?, even though, in fact, E(s?) < E(s3)- 
Thus minimum residual variance cannot be taken as a single overriding criterion 


+ This argument is due to H. Theil, Principles of Econometrics, Wiley, New York, 1971. p. 543. 
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for equation selection. Since there is a monotonic negative relationship between 
R? and s?, the same comments apply to a maximum R2 criterion, but the fact that 
the degree of fit does not discriminate perfectly between the true and competing 
models does not mean that evidence on fit is to be ignored. 


Criteria for individual coefficients. There are two important criteria under this 
heading. Economic theory is rich in qualitative predictions about the direction of 
various effects. Thus one looks for agreement between a priori expectation and 
the signs of estimated coefficients. Second, one looks for correctly signed coeffi- 
cients which have reasonable statistical significance. The latter criterion should 
not be applied too stringently since we have seen, for example, that collinearity 
among the regressors can inflate estimated standard errors. The R? criterion also 
has implications for the significance level of individual coefficients. As shown in 
Problem 5-12, R? only increases with the addition of an extra regressor if the F or, 
equivalently, the 1 statistic for that variable exceeds unity, which corresponds to 
the use of a significance level of about 30 percent rather than the conventional 5 
or | percent level.¢ 

The previous remark is in the context of a fixed sample size. However, any 
substantial increase in sample size has implications for significance levels. As seen 
in Chap. 5, the test of the hypothesis that a subvector of g elements in B is the 
zero vector is given by 


F= (ee, — e)/q es F(q,n —k) 
ee/(n—k) 
where e’e is the residual sum of squares from the unrestricted model and eLey 
that from the restricted model, the relevant q variables having been omitted. This 
Statistic is written equivalently as 


1 — R? q 
Thus even though R? — R3 may be very small, the test statistic can become 
arbitrarily large with increasing sample size. Using a given significance level, the 
null hypothesis is more and more likely to be rejected as n increases. This point 
has been emphasized by Leamer, who, along with others, argues that the signifi- 
cance level for this kind of test should be adjusted downward for larger samples.¢ 


Well-behaved disturbances. As seen in earlier chapters, a homoscedastic nonauto- 
correlated disturbance term is a wonderfully powerful assumption from a statisti- 
cal point of view. Its presence underlies the derivation of a battery of statistical 
tests, while its absence seriously distorts some of these tests and calls for revised 
Procedures. Thus it is essential to examine the properties of the disturbance term 
in order to assess the validity of the statistical tests being applied. However, the 


+ Recall that 1(r) = yF(1, r) . See App. A-7. 
+E. E, Leamer. Specification Searches, Wiley. New York, 1978. pp. 88-89. 
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same property is often implicitly, and occasionally explicitly, taken as a desirable 
feature of a well-specified economic relationship. The purpose of such a relation is 
to model the behavior of some group of economic agents. Can it be a good model 
if the net effect of the omitted variables displays some systematic autocorrelated 
pattern? A misspecification of functional form can also lead to nonrandom 
disturbances. Thus statistical and economic considerations alike lead one to look 
for relations with well-behaved disturbances. However, as noted in Chap. 8, 
discrepancies between decision periods and data periods may well produce 
autocorrelated disturbances in a properly specified economic model. It is also the 
case that efficient estimation of fairly complex dynamic regressions may require 
an autoregressive specification for the disturbance term, but this is a by-product 
of considerations of statistical efficiency: the original relationship is desired to 
have a nonautocorrelated disturbance term. This point is emphasized by Hendry 
and Mizon.} Suppose one-period lags on both variables are sufficient to give a 
white noise disturbance. The original (general) dynamic relationship between y 
and x may then be written as 


Ye = ByYy-1 + Yo%, + W%1-1 + % (12-3) 


where || < 1 and {v,} is white noise. Using the lag operator, this may be 
rewritten as 


(1 — BL) y, = (% + nL)x, +», 
If it then were true that the parameters satisfied a restriction 


n= —Bi% 
the relation would become 
(1 — B,L) y, = ¥o(1 — BL) x, + v, (12-4) 
which gives 
Ne = YoX, + Uy 
with u, = Byu,_, + 2, =) 


If the restriction were valid, estimation of Eqs. (12-5) would involve just three 
parameters, namely 8,, yo, and 02, whereas estimation of Eq. (12-3) involves four 
parameters. However, Eq. (12-3) has, in fact, to be estimated to test the restric- 
tion. The payoff is improved statistical efficiency of the parameter estimates if the 
restriction is upheld. Comparing Eqs. (12-3) and (12-4) the restriction implies that 
{y,) and {x,} have a common factor with root B,.t There may be no economic 
rationale for the restriction or common factor. If so, it is likely to be rejected and 
the “general” equation (12-3) cannot then legitimately be reduced to the “simpler” 
form in Eq. (12-5). Sargan’s COMFAC program tests for the existence of 


‘i D. F. Hendry and G. E, Mizon, “Serial Correlation as a Convenient Simplification, Not a 
Nuisance; A Comment on a Study of the Demand for Money by the Bank of England,” Economic 
Journal, vol, 88, 1978, pp. 549-563 

+ Strictly speaking the root of the polynomial | — 8, L = 0is 1/B,. 
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common factors in dynamic regressions.} Suppose a general dynamic relation 
B(L) y, = y(L)x, + 2, (12-6) 


can be found where {v,)} is white noise. If B( L) and y(L) have, say, two common 
roots, there exists a quadratic in L, say 6(L), which is common to both B(L) and 
y(L). Thus we can write 


B(L) = 8(L)BX(L) and y(L) = 8(L) y*(L) 
so that Eq. (12-6) becomes 
B*(L) y, = y*(L)x, + u, 
8(L)u, =v, 

which involves considerably fewer parameters than Eg. (12-6). Suppose, for 
example, that a relationship 

Y= BY + Br Ya + Yor + WX + Xa + Xia + (12-8) 
was estimated and a common polynomial 6(L) = (1 — L)(1 — pL) found. The 
relation (12-8) may then be written 

(1 = L)(1 = pL) y, = (1 = L)( = pL) (yg + YL) x, + & 

which only involves three parameters instead of six. For estimation purposes it 
may be put in the form 


(12-7) 


Ay, = vo Ax, + yf Ax, + 4, 
with (1 — pL)u, = », 
that is, a simple relationship between first differences with an AR(1) process in 
the disturbance. The Sargan-Hendry message is that researchers should not begin 
with simplified specifications such as Egs. (12-5) or Egs. (12-9), but should instead 


commence with a general model containing sufficient lags to yield a white noise 
disturbance and then test to see how far it can be legitimately simplified. 


(12-9) 


Stability of the relationship. A very important indicator of the quality of a 
functional specification is the stability of the parameters over various data sets. 
This may be examined in two alternative fashions. One is a straightforward test 
for structural change as outlined in some detai] in Chap. 6. This presupposes 
sufficient observations in each subset of data to permit estimation of all parame- 
ters. When that is not the case, the Chow forecasting test may be applied. This 
has already been set out in Example 6-5 of Sec. 6-2 and was derived in Sec. 10-1 
by using recursive residuals. However, it is often derived in an alternative fashion 


as follows. 
Suppose the usual linear model has been fitted to n observations of k 


variables, The OLS coefficient vector is 
b=6+ (XX) ‘Xu 


+J. D. Sargan and J. D. Sylwestrowicz, “COMFAC: Algorithm for Wald Tests of Common 
Factors in Lag Polynomials,” User's Manual, London School of Economics, London, 1976. 
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and it is assumed, as usual, that u ~ N(0, 071,). Now suppose a new set of m 
(< k) observations on these same variables becomes available. On the assumption 
that the original model still holds, the new observations may be characterized by 


Yo = XoB + uo 


where E(ugu,) = o71,,. The m observations are insufficient to allow reestimation 
of the model, but one may forecast the y, vector by 


9% = Xob 
The vector of forecast errors is 
€) = % ~ 99 = XoB + to ~ Xo[B + (XX) 'X'u] 
= uy — Xo(X’X) "Xu 
It then follows directly that 


€ ~ N(0,07V) 
where V =I,, + Xq(X’X)'X, 
"Vv | 
Thus 8 ~ x2(m) 
a 


Since e’e/o? has an independent x?(n — k) distribution, it follows that under the 
hypothesis of parameter constancy 


pn fille + X0(XX)_'X6] '60/ cen 9 — K) (12-10) 


ee/(n—k) 


The hypothesis of a stable relationship would be rejected if the F statistic in Eq. 
(12-10) exceeded some preselected critical value. Chow demonstrates the equality 
of Eq. (12-10) with the alternative expression in Eq. (6-27).+ In an interesting and 
important study Jorgenson, Hunter, and Nadiri have used measures of fit, 
Durbin-Watson Statistics, and tests of structural change to assess and compare 
different investment equations. When the regressors are stochastic, the test 
statistic in Eq. (12-10) will only be approximately distributed as F. Hendry 
Suggests using an asymptotically equivalent test which neglects the variation due 
to estimating the B vector. Under the hypothesis of parameter constancy§ 


& 4 2(m) (12-11) 


1G. C Chow, “Tests of Equality between Sets of Coeffici i i egressions,” 
Econometrica, vol. 28, 1960, pp. $91-605, > arlene cenmaa 
+ D. W. Jorgenson, J. Hunter, and M. I. Nadiri, “A Comparison of Al ive EA i 
. . l. ternative Econometric 
Models of Quarterly Investment Behavior,” and “The Predictive Performance of Puce Models 
of Quarterly Investment Behavior,” Econometrica, vol. 38, 1970, pp. 187-224. 
§D. F. Hendry, “Predictive Failure and Econometric Modelling in Macro-Economics: The 
Transactions Demand for Money.” London School of Economics, London, September 1975. 
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where 


2 ee 
nk ; 

The test of forecast errors in Eq. (12-10) may be extended to deal with joint 
forecasts from the reduced form of a simultaneous equation model.+ 

An aspect of prediction which is frequently ignored, and unjustly so, is the 
longer-term implications of the dynamic regression that has been estimated, For 
example, return again to our hypothetical demand function for oil, and ask 
questions such as: 


s 


+ What does it imply about the long-run elasticities? 

+ What does it imply about the length of the long run, how long is it estimated to 
take for a full adjustment to a “shock”? 

els the reaction path plausible or is it the result of a statistical straitjacket 
imposed on the data? 


The model’s answers to questions such as these have to be put up against the 
intuition and good sense of the researchers themselves and, more importantly, the 
intuition and good sense of informed critics. This may seem very “ unscientific” 
and perhaps it is, but it is nonetheless very important and in the next two sections 
we present a brief discussion of some ways in which it is attempted. 


The Cairncross Test 


We have suggested that there are various aspects of any specification which are 
important, namely, 


1, Residual variance (or fit) 

2. Signs and precision of specific coefficients 
3. Properties of the disturbance 

4, Parameter stability (predictive performance) 


In comparing different specifications there is no serious problem if the 
indicators more or less all point in the same direction, as was the case in the 
Jorgenson, Hunter, and Nadiri study of the investment equation. Where contrary 
indicators emerge, the choice between specifications has to rest on the relative 
importance of various factors to the decision maker. As in the choice of a 
husband or a place to live, a specification is a “package deal.” No one has yet 
found a way to piece together the perfect package, though we continually try to 
improve, as is evidenced by the statistics on divorce, population mobility, and the 
flood of computer printout. In the case of economic specifications the choice can 
be based less on purely subjective personal considerations and more on the 
accumulated knowledge and experience of the critic. 


+ See P. H. Dhrymes et al., “Criteria for Evaluation of Econometric Models,” Annals of Economic 
and Social Measurement, vol. 1, 1972, pp. 307-308. 
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This point was brought home to me forcefully and convincingly a few years 
ago when I was working on an energy research project. Each month a report on 
the econometric activity had to be presented to a steering committee in London, 
presided over by Sir Alex Cairncross.; Cairncross had (and still has) a healthy 
scepticism of econometrics, no doubt partly due to his days at the U.K. Treasury 
when, to quote, “the young men might present me with thirty different equations 
to ‘explain’ British imports, so that, at the end of the day, neither they nor I knew 
what determined British imports.” Each month the econometric output was 
subjected to his shrewd, informed, and penetrating scrutiny. Eventually, however, 
there came a monthly report which secured the approbation, “I wouldn’t mind 
getting on a plane and taking this to Riyadh.” Presumably I had been engaged in 
some successful data mining or, perhaps, had been “learning by doing,” so I 
suggested to him jokingly that the Cairncross test would appear in the next 
edition of Econometric Methods. The two-step Cairncross test is thus as follows. 


1, Compute your R, Durbin-Watson statistic, assorted 1, F, and x? statistics for 
the best specification you can manage. 

2. Send the resultant report to Sir Alex Cairncross with the question “Would 
you be prepared to take this to Riyadh?” 


The suggestion is, of course, not entirely frivolous. Researchers circulating 
their discussion papers are carrying out informal Cairncross tests. For Cairncross 
substitute the expert of your choice and for Riyadh substitute Washington, the 
editorial offices of the American Economic Review, or some other preferred 
location. 

Economists, however, are not alone in facing difficult choice problems in 
which all the elements cannot be fully quantified and brought together in a single 
equation. Circumstances comparable to those of the Cairncross test arise in a 
broad spectrum of commercial, industrial, and governmental decisions, where the 
best possible research still does not eliminate the need for some personal element 
based on judgment and experience, 


The Bayesian Approach 


A Bayesian would criticize the Cairncross test on the grounds that the opinions, 
judgment, and, possibly, prejudices of the expert have only been introduced in 
some implicit, informal, and nonreproducible fashion. Bayesians also tend to 
make a more general criticism of classical inference in that its procedures are 


} Sir Alex C airncross, a very distinguished British economist, was for many years economic advisor 
to Het Majesty's Government and subsequently Master of an Oxford College. I hesitate to give his 
present address lest he be deluged with Manuscripts from aspiring econometricians, but I suspect he is 

nostly to be found in his Scottish fetreat north of the Solway Firth. enjoying the Scotsman’s favorite 
view “looking down upon England.” ‘ 4 

+ The Bayesian approach requires a book of its own, The premier references are A. Zellner, An 
Introduction to Bayesian Inference in Econometrics, Wiley, New York, 1971; and E, E, Leamer, 
Specification Searches, Wiley, New Y ork, 1978. , : wh 
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justified in terms of sampling distributions, which picture the behavior of estima- 
tors in repeated sets of sample data. Such repeated samples are never drawn. We 
typically have one sample and have to do the best we can with that. Moreover the 
possible losses from incorrect conclusions are not usually considered in the choice 
of inference procedures. 

In principle the Bayesian approach can cope with these problems in one 
integrated framework. The underlying principle is both simple and beautiful, but 
there are problems in the way of practical applications, especially in the area of 
large and complex models. The approach may be illustrated in two steps. 
Suppose, first of all, that there is no uncertainty about the form of the relevant 
model but only about its parameters. To be more specific, let us assume that we 
have n observations drawn at random from 


p(y) = (2703) 7 


1 2 
(oh, 80) Pepe (0 ons 
| 202? bh) 


where og is known, but the mean p is unknown. This gives the vector 
yl ders Sal! 
The probability density function (pdf) for y is then 


sept (y= » (12-12) 


bint 


—n/2 
ply|n) = (2103) "°exp} - 


This i 5 the likelihood for the sample observations, conditional on the parameters 
and 62, but the latter has been omitted from the left-hand side since it is assumed 
known. 

The first crucial element in the Bayesian approach is to postulate the 
existence of prior information about . This may come from theoretical sources, 
from previous empirical studies, hunch, judgment, or what have you. Such 
information cannot be exact, so it is formulated in a stochastic fashion. It is 
theoretically convenient to model this information in a way that is compatible 
with the likelihood in Eq. (12-12). This leads to the concept of the conjugate prior. 
The prior pdf for p is thus taken to be normal and written 


a(n) = (2208) "exp{ = su =m" (12-13) 
oO 
where m and o” are specified numerically.+ 


From elementary probability theory for any two events A and B we can write 


Pr( A, B) = Pr(A) - Pr(B|A) = Pr(B) - Pr( A|B) 


+ We are using p(-) to indicate a pdf for sample data and 7(-) to indicate a pdf for parameters. 
This practice was suggested in K. M. Gaver and M. S. Geisel, “Discriminating Among Alternative 
Models: Bayesian and non-Bayesian Methods,” in P. Zarembka, Ed., Frontiers in Econometrics, 


Academic Press, New York, 1974, Chap. 2. 
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from which 
Pr( B) - Pr( A|B) 
Pr( A) 


Letting A represent the sample vector y, B the unknown parameter 1, and 
replacing probabilities by pdf's, we have 


a(u) - p(y|n) 2. 
2) awa 


In Eg. (12-15) the expression 7(u|y) represents the posterior pdf for p, and 
comparison with 2() indicates the change in the researcher's beliefs about 1 
brought about by the sample information in y. The denominator in Eq. (12-15) is 
given by 


Pr( BA) = (12-14) 


a(uly) = 


p(y) = Jp(y|u)7(m) du 
For given y, m, 07, and o? this reduces to a constant. Thus Eq. (12-15) can be 
rewritten as 

(nly) « m(m) - p(yln) 
Substituting from Eqs. (12-12) and (12-13), 


1] (w=m)* , By = 1)" 
ol pat a 


(nly) & on - 


As shown by Zellner, this pdf can be simplified tot 
a | fio? + mog/n } 


262 
20°o3/n a? +o3/n 


a(nIy) & exp 


where ji = Yy,/n. Thus the posterior pdf for p is also normal with mean 
_ fi(a3/n) | + m(a?) 


E(u) = 
(og/n) + (02)! 


(12-16) 


and 
ol «hey Pa 
(o2/n) | + (07)! 


Formula (12-16) shows that the posterior mean is a weighted average of the 
sample mean and the prior mean, the weights being the reciprocals of the 
Tespective variances. Strong prior information (low 0”) gives the prior mean a 
large role to play in determining the posterior mean, and conversely, strong 
sample information (large n and/or low 6?) gives the sample mean a dominating 
role. The importance of the posterior mean rests on a basic result in Bayesian 


var(n) = 


7A. Zellner, An Introduction to Bayesian Inference in Econometrics, Wiley, New York, 1971, p. 150. 
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statistics that if one assumes a quadratic loss function for errors in estimating p, 
the estimate which minimizes the expected loss is the posterior mean.} 

A parallel result holds in the linear regression case when the form of the 
model is assumed known, but a multivariate Bayesian prior distribution is 
specified for the B vector.¢ Assuming a multivariate normal prior, this involves 
specifying the mean vector b, and all the elements in the variance-covariance 
matrix, indicated, say, by oN, '. Assuming the usual linear model 


y=XB+u with u~ N(0,021) 
the mean of the posterior distribution is§ 
Des = (Nz + XX) '(Nyb, + X’Xb) (12-17) 


where b = (X’X)~'X’y is the OLS estimate of B from the sample data. This is the 
linear regression equivalent of Eq. (12-16). The posterior mean vector is seen to be 
a matrix weighted combination of the prior vector b, and the OLS vector b, with 
weights proportional to the inverses of the respective variance matrices. Two 
immediate problems arise with any attempt to implement Eq. (12-17) in practice. 
The first relates to the problem of specifying numerically the elements of the 
variance matrix N, '. One may have some intuition about the mean vector b,, but 
it is difficult to see the source of numerical information about variances and 
covariances. The second problem, as Leamer emphasizes, is that if b, and N, are 
specified, the posterior is then a function of these specific values. Other investiga- 
tors might specify different parameters for the prior distribution with different 
implications for the posterior distribution. What is important is to make clear the 
mapping from priors to posteriors and to investigate, if possible, the implications 
of various classes of prior distribution for posterior distributions. 

The second step in the Bayesian approach relaxes the assumption that the 
form of the model is known and that the only uncertainty relates to the parameter 
values, The extension allows uncertainty about both models and parameters. To 
simplify the exposition let us suppose that there are just two competing models, 
both in the linear regression format. They are specified as 


M: y=X,B) +4, 
My: y = X2B, + uw. 


where X, is nonstochastic of order n x k,, of full column rank, u; ~ NO, o71,,), 
and there are n sample observations. There are two possible situations. One is 
where the models (or hypotheses) are nested, which is the case when X,, say, 
includes all the variables in X, plus some others. The nonnested case occurs when 
some (or all) variables in X, do not appear in X, and vice versa. Classical 
inference procedures apply in a straightforward fashion to the nested case, but 


+A. Zellner, op. cit., p. 24. 

+ In practice, there is no justification for assuming the disturbance variance 92 to be known. Thus 
the prior distribution should incorporate B and o2. 

§ E. E. Leamer, Specification Searches, Wiley, New York, 1978, p. 78. 
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have no simple treatment for the nonnested case, where Bayesian procedures 
admit, in principle, of a very simple solution. 

For any model the marginal density of the observations M,, sometimes 
referred to as the predictive pdf, is given by} 


P(viM,) = f p(y\B,, M,)7(B1M,) dB, (12-18) 


In Eq. (12-18) p(y|B;, M;) is the likelihood for the sample observations, condi- 
tional on the model M, and its parameters B,, and 7(B,|M,) is the prior density for 
the parameters, given that the model is M;. Equation (12-18) says that if M, is the 
true model, the marginal pdf for the sample observations is found by taking a 
weighted average of the sample likelihoods, where the weights are the elements of 
the prior distribution for the parameters, given the model. Now suppose that 
associated with each model there is a nonnegative fraction P(M,) indicating the 
prior subjective probability that M, is the true model and, in this case, such that 


P(M,) + P(M,) = 1 
The unconditional pdf for the sample observations is then 
ply) = P(M,) p(y|M,) + P(M;) p(y|M,) 


and the application of Bayes’s rule to revise the prior probabilities of the models 
gives 


P(M,) p(y|M,) 
P(y) 


In comparing two models there are just two Possible losses, one if M, is chosen 
when M/, is the true model and the other if My, is chosen when M, applies. If these 
losses were equal, the decision rule that minimizes the posterior expected loss is as 
follows. Choose M, if 


P(M,\y) = (12-19) 


POMiIy) _ PCM.) p(yiM,) 
P(Ma\y) ~ PCM) p(y|Ms) 


is Greater than 1. Equation (12-20) defines the Posterior odds ratio, which is seen to 
be equal to the prior odds ratio multiplied by the ratio of the marginal densities 
(weighted likelihood functions). If there are more than two models, posterior odds 
as defined in Eq, (12-20) can be computed for any pair. 

It is clear from Eq. (12-18) that the formidable task in the computation of Eq. 
(12-20) is the evaluation of the marginal pdf's. If a multivariate normal prior is 
assumed for B,, given M,, that is, 


m(b|M,) is N(bx, Nj!) 


then Leamer has shown that p(y|M,) varies inversely with a quadratic form Q, 


(12-20) 


+ We are assuming unrealistically, but for simplicity, that there is no uncertainty about the 
disturbance yariances 
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which may be expressed in alternative ways as} 
Q, = (y — X,b,)'(y — X,b,) + (b, — byx)'(Nr! + Np ')(b, — by) (12-21) 
or 
Q, = (y ~ X,bj4)!(y — X,bj4) — (b, — bj)'N (NF +N,)'N,(b; — b,x) 
(12-22) 


where N, = X/X, and b, = (X/X,)~'X/y,. The first term in Eq. (12-21) is the 
residual sum of squares from the OLS fit of model M,. This has to be increased by 
a factor depending on the discrepancy between the OLS vector and the prior 
mean vector. Alternatively, the first term in Eq. (12-22) is the error sum of squares 
if the coefficient vector were set equal to the prior mean yector. This is adjusted 
downward by a term which is again dependent on the discrepancy between the 
sample and the prior coefficient vectors. Thus apart from the prior odds ratio, the 
choice between models would depend on these adjusted sums of squares, which 
are a mixture of sample and prior information. 

Readers must judge for themselves whether a criterion such as Eq. (12-20), 
for all its elegance and simplicity, is a valid guide for choice. Suppose that just 
two crude and simple models are being compared. M,, say, is a “Keynesian” 
reduced-form equation relating GNP to “exogenous expenditures,” while M, is a 
“Friedmanian” equation relating GNP to “money.” If the Ghost of Keynes could 
be contacted, he would presumably offer a prior-odds ratio P(M,)/P(M)), 
dramatically different from that forthcoming from Professor Friedman, How can 
the protagonist of one theory begin to specify the prior pdf's for the parameters 
of the opposing theory, which he basically regards as false? Must then a Bayesian 
researcher be certified ideologically pure and unbiased before being allowed to 
specify prior odds and prior densities for model parameters? 

A partial resolution to the problem of excessive dependence on priors, which 
are, perhaps, spuriously precise, idiosyncratic, or just personal to one investigator, 
is provided by some recent work by Chamberlin and Leamer.t It is assumed that 
in a single equation there are one or more “focus” variables, whose coefficients 
are of crucial interest. The equation may also contain other “doubtful” variables. 
The investigator specifies a prior zero mean vector for the doubtful variables. 
However, he does not have to specify the elements of the prior variance matrix, 
merely that it belongs to the class of positive definite or semidefinite matrices. 
Leamer’s SEARCH program computes bounds on the focus coefficients, so that 
the researcher can study the robustness of these coefficients under a variety of 
specifications. This approach is appealing and seems likely to be developed and 
considerably extended. It adds yet another dimension to the array of information 
that we can obtain on any specific problem. How to weigh and interpret the 


+E. E. Leamer, op. cit., p. 109. : 
+E. E, Leamer, op. cit., pp. 182-201. See also T. F, Cooley and S. F. LeRoy, “Identification and 
Estimation of Money Demand,” American Economic Review, vol. 71, 1981, pp. 825-844, for a very 


interesting detailed application of the procedure. 
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jigsaw of computation and information will still depend on the vital spark of 
human imagination and powers of judgment. 

The position can best be summarized by a quotation from the late Jacob 
Bronowski.j Though writing of the physical world, his comments are very 
apposite to the economic and social world that we study. 


The world is not a fixed, solid array of objects, out there, for it cannot be 
fully separated from our perception of it. It shifts under our gaze, it interacts 
with us, and the knowledge that it yields has to be interpreted by us. There is 
no way of exchanging information that does not demand an act of judgment. 


Science is a very human form of knowledge. We are always at the brink of the 


known, we always feel forward for what is to be hoped. Every judgment in 
science stands on the edge of error, and is personal. 


14. Bronowski, The Ascent of Man, Little, Brown, Boston, 1973, Pp. 364 and 374. 


APPENDIX 


A 


MATHEMATICAL 
AND STATISTICAL APPENDICES 


A-I1 FUNCTIONS AND DERIVATIVES 


The purpose of this section is merely to remind the reader of various notational 
conventions for functions and derivatives. It is not intended to review the basic 
rules of differentiation.} If y is a function of x, the relationship may be denoted 
variously as 


y=y(x)  y=flx) y=slx) y= F(x) 


and so on. Once a specific functional form for the relationship has been assumed, 
one can determine the shape of the function by studying the behavior of y in 
response to variation in x. Starting from an initial value, say, x9, and moving to 
X, = X + Ax will trace a movement in the dependent variable from yo to y;- The 
ratio 

Ay _ "7% 

Ax 6 xy OG 


measures the change in y per unit change in x. Taking the limit of this ratio as 
Ax — 0 gives the derivative of y with respect to x, written variously as 


Wop eS. 
Fie iar | 


The derivative measures the slope of the function at a specific point and is, in 
general, a function of x, as is emphasized by the f’(x) notation. Thus it may itself 


+ For a lucid introduction to the calculus and other mathematical topics of special relevance to 
economists see A. C. Chiang, Fundamental Methods of Mathematical Economics, 2d edition, McGraw- 


Hill, 1974. 
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be differentiated with respect to x, giving the second-order derivative, denoted by 
dy 
dx? 
When y is a function of several variables, say, 
Y= f(Xq, Xa50--5 Xn) 


where the x’s are capable of moving independently of one another, then one may 
study the change in ) in response to the change in any one of the independent 
variables (or arguments) of the function, the other independent variables being 
held constant at any arbitrary set of values. This gives rise to the partial 
derivatives, denoted by 


Ones taf 4((5), 


Plies i= eas 
de aE slim, (32 (2a as n 


In this notation f,, for example, would indicate the rate of change of y with 
respect to x>. Once again further partial differentiation may be carried out, 
yielding the second-order partial derivatives 


Alternatively, if the independent variables are given separate labels, as in 


y =f(u,v) 
the partial derivatives may be denoted by 
ay _ oy a?y 


Gu ay Se. Quay = Su 


though, even here, one may see /, used for f,, and f, for /,. 


A-2 EXPONENTIAL AND LOGARITHMIC FUNCTIONS 


Consider the function 
Mesh \e. 'b 210 (A-1) 


This is called an exponential function since the variable x appears as the exponent 
of the constant, or base, b. We rule out negative values for b, since if x were, say, 


one-half, y would be the square root of a negative number, which is imaginary. If 
x denoted time ¢ measured at equal intervals, then 


y, =o and 
Thus y, denotes a series which is growing (b > 1) or declining (0 < b < 1) ata 


constant rate. If we set b = 1 + r, then r denotes the proportionate rate of change 
in y per unit period of time. 
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Figure A-1 (a) Exponential function; (6) logarithmic function. 


The logarithm of a number to a given base is defined as the power to which 
the base must be raised to give the number. Thus in Eq. (A-1) x is the logarithm 
of y to base 5, written 


x = log,y (A-2) 


This is the inverse of the exponential function. The first expresses y as a function 
of x and the second expresses x as a function of y. Typical graphs for b > | are 
shown in Fig. A-1. If the graph in Fig. A-1b were superimposed on Fig. A-la with 
the y axis on the y axis and the x axis on the x axis, the curves would coincide. 
Numerical calculations are facilitated by the tables of common logarithms, which 
are taken to the base 10. Thus, for example, log;9100 = 2 since 100 = (10)?. In 
practice the subscript 10 is rarely shown explicitly. For mathematical purposes it 
is usually much more convenient to work with natural logarithms, which are taken 
to base e. This is the mathematical constant defined by} 


e= lim (1 + *) = 2.41828 


no 


This has the remarkable property that if 


Yue 
then 
ORA(ST ei 
ae dx? 


that is, all derivatives are equal to the original function. The function is written in 


¥ See also the footnote on p- 68. 
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alternative forms as 

y=e* or = y = exp{x} 
and the inverse logarithmic function is written ast 

x=log.y or x=Iny 
The general exponential function is written as 

y = Ae™ ory = Aexp{cx} 

which has the effect of stretching or contracting the typical exponential shape in 
Fig. A-la vertically and horizontally. 


If the inverse function exists, as it does when y = f(x) is monotonic (that is, 
to each value of x there corresponds a unique value of y and vice versa), then 


ce 
dy dy/dx 
If y = e*, then dy/dx = e* = y, and so for the inverse function, x = In y, dx/dy 
= 1/y. Thus we have the two standard forms: 
Este een 


* god 
y=Inx agin 


Suppose we have y = log x. What then is dy/dx? We may write 
x= 10" 
Thus 


In x 
Y= tnio = nx: loge 


Since it may easily be shown that In 10 - log e = 1. It then follows that 


d 
diene (as) 


Finally we may note a frequently used connection between logarithms and 
elasticities. If y = f(x) and a change Ax is imposed leading to a change A y, then 


By Axe ays 
y x Ax y 


measures the proportionate change in y per unit proportionate change in x. The 


elasticity of y with respect to x is defined as the limiting value of this ratio as 
Ax — 0, that is, 


(Point) elasticity of y with respect to x = 


RS 
<i 


+ In general In denotes a logarithm to base ¢ and log a logarithm to base 10. 
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It may be shown that 


dx y d(inx) d(logx) 
To show the first part of the identity let 
z=Iny y=f(x) and x=e”  sothatw=Inx 


ay ty d(In y) _ d(log y) (A-4) 


Then 


ae rk usin Wins So Meera, 
~y dk 8 aw dw/dx ~*~ 

= elasticity of y with respect to x 
The second part of the identity shows that the same relation holds if logarithms 
are taken to base 10, since there is a proportionate relationship between loga- 
rithms to the two bases. 

It follows from Eq. (A-4) that a functional form which implies a linear 
relation between the logs of the variables is a constant elasticity function. For 
instance, 

y = Ax* 
gives 
log y = log A + a(log x) 
so that a is the elasticity of y with respect to x. A simple way to fix the meaning of 
an elasticity is that it measures the percentage change in y produced by a / percent 
change in x. 
The elasticity concept extends to functions of several variables. Thus 
y = Ax%vz? 
is a constant elasticity function, where a, B, and y are the partial elasticities with 
respect to the arguments x, v, and z. 


A-3 OPERATIONS WITH SUMMATION SIGNS 
The Greek capital sigma is used to indicate summation. Thus 
XS X= (% +t + + X,) 
int 


This sum is variously denoted by 


x Ex ES Es on EX 


i=! mt 


so long as no ambiguity is involved in any particular application. 
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If each value of X is multiplied by a constant a, the sum is 
n n 
(aX, + aX, +--+ + aX,) = LaX,=a) X, (A-5) 
i=l i=1 


Thus a constant appearing after a summation sign may be moved in front of it to 
multiply the sum, If each X, in Eq. (A-5) were equal to unity, LX, =n and 


Thus the summation of a constant over n points is n times the constant. 
The arithmetic mean of the X’s is 


Ex, 
n 


we 


It follows directly from the definition that 


SOR) ex 20% 


i=l 


so that the algebraic sum of deviations around an arithmetic mean is zero. The 
sum of squared deviations from the arithmetic mean is 


v(x -X) = 5 (x2 -28x,4 7) 


i=l i=l 


=P XxP = 2X YX, + nX? 


2x2 4px) 


2 


or alternatively, 


n 


X(x%,- X) = Ex? - 0X? 


i=] 


that is, the sum of squared deviations about the sample mean can be expressed as 
the sum of the squared values of the original variables less a correction factor for 
the mean. Similarly, one may derive 


n 


U(X, — ¥)(¥%- ¥) = D(%Y) - nk¥ 


i=l 


1 
=L(4Y) = (LX ML) 
Suppose a variable has two subscripts, say, 


Beale eye... Psy = 1; 2,06 n 
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This is illustrated in the following table: 


1 Xie Maree Xin, 

2 Xn, Xn, Xan, 
Class ry r < 

P Xpiy Xp2r---y Xpny 


As an example, X might measure personal income and the sample data consist of 
n, observations from social group 1, 2 observations from social group 2, and so 
forth, Total income in the sample is defined by 
pom 
x EY X, — or, more simply, LX, 
td 


i=l j=l 


The total number of sample observations is 


Pp 
ne a 
i=! 
and the overall mean income is then 
Xx au x. iX i 
n 


The mean income for the ith group, or class, is 
Dpe% j 
n 


ee 


i 


The sum of squared deviations about the overall mean is 


x(x,-%) - E[(x,- ¥)+(%-X) 


bef 


The last term may be written 
Pes Nhe ee Tipe 3s 
E E(x, - ¥)R-*X)= B(K-*) L(G - ¥) 
i=l j=l i=1 ja 
since the factor (X, — ¥) does not involve the j subscript and so may be moved in 
front of the summation over j. But 


ni 


d (%)- %) =9 
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for each i, and so the whole term vanishes. The middle term may be written 


Say Bean ee ae) 
L(%-*X) - 0 U(%-*) 
i; im j=l 
P: wee ei 
“7 D2 X,- X) 
i=! 
since (X, — xy is a constant for each element in the ith group, so that the sum 
over j is simply n,(X, — X). Thus 


¥(x,-¥) = D(x, K+ En(%-*) 
ii if i 
This decomposition is often written as 
Total sum of squares = within-group sum of squares 
+ between-group sum of squares 


A-4 RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS} 


A discrete random variable X consists of a set of possible values x,, x,.-., X, 
and associated positive fractions (probabilities) p,, p2,..., p, such that 
k 
pe pi 
int 


The two most important features of the probability distribution are the mean and 
the variance. The mean, often denoted by p, is defined as 


k 
w= E(X)= ¥ xp, (A) 
imt 


which is just a weighted average of the x values, the weights being the respective 
probabilities. E is the expectation operator, and it may also be applied to various 
functions of X. For example, E(X*) indicates the expected value of X?. The 
possible values for X? are x}, x3,..., x}, which occur with probabilities 
Ps Paseo s Py» Thus 


k 
E(X?) = © x?p, 


i=l 


The second most important feature of the probability distribution is the variance 


} For the statistical paragraphs in this appendix, two of the most lucid texts at an introductory 
level are P. J. Hoel, Introduction to Mathematical Statistics, 4th edition, Wiley, New York, 1971, and 
L. D. Taylor, Probability and Mathematical Statistics, Harper and Row, New York, 1974. 


—EE— 
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or expected squared deviation about the mean. This is usually denoted by o*, Thus 
o? = E{(X~ p)’} (A-7) 


Evaluating this from first principles, 
k 
2 
E(X-p)}= L(x wp, 
i=t 
= Lx}p, — 2wEx,p, + WLp, 
= Ex}pp- (Ex, p,)° 


= E(x?) — [E(X)}? 


This result may also be obtained by squaring the expression in Eq. (A-7) and 
applying the expectation operator to each term in turn. Thus 


E{(X — p)?) = E(X? = eX + w) 
= E(X?) = 2wE(X) + E(w) 
= E(X?) - [E(X)? 


since E(p?) indicates the expectation of a constant, which is simply the constant. 
When the random variable is continuous, the discrete probabilities are 
replaced by a continuous probability density function (pdf), usually denoted by 
p(x) or f(x). An example is shown in Fig. A-2. 
The pdf has the properties that 


f(x) 20 forall x 
[i)de=t 
and S100) dx = Pela <x <5) 


The mean and the variance are defined as before, but integrals now replace 
summation signs. 


Se) 


oa 6 * Figure A-2 
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We are often interested in the joint variation of a pair of random variables. 
Let the variables X, Y have a bivariate pdf denoted by f(x, y). Then 


f(x,y) 20 for all x, y 


[f(x 9) dxay = 1 
and 
[{f'10s ») axa = Prla< X<b,c<y<d] 


Given the joint density, a marginal density is obtained for each variable by 
integrating over the range of the other variable. Thus 


Marginal pdf for X= [~ f(x, y) dy = f(x) 
-o 
and 


Marginal pdf for ve" f(x, y) dx = f(y) 
= 


A conditional pdf for Y, given X, is defined as 


f(y) 
(A-8) 
ie RICE 
and similarly, a conditional pdf for X, given Y, is defined as 
f(y) 
(xly) = 
a Fa 


Two variables are said to be statistically independent, or independently distrib- 
uted, if the marginal and conditional densities are the same. Thus the joint 
density can be written as the product of the marginal densities 


f(x,y) = f(x) fF) (A-9) 


The mean and variance for each variable may be obtained from the marginal 
densities. Thus 


y= E(X) = f [xf x, ») dedy = fxf(x) dx 


0} = var(X) = f(x ~ u,)' f(x) dx 


and similarly for the mean and the variance of Y. A new statistic for the bivariate 
case is the covariance. It is defined as 


Oy = cov( X, ¥) = EX(x— wy = Hy) = ff(e — w= By A(x) dx dy 


and measures the linear association between the two variables. For independently 
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distributed variables the covariance is zero since 


Jf — modo ~ mye) debs = [x= nF) defy = 1,40) 


=0 
In general, the converse of this proposition is not true, that is, a zero covariance 
does not necessarily imply independence. An important exception to the proposi- 
tion, however, exists in the case of normally distributed variables, as is shown in 
App. A-5. 


A-5 NORMAL PROBABILITY DISTRIBUTION 


The pdf for the univariate normal distribution is 


il 1 2 

f(x) aa (x- #) (A-10) 
This defines a two-parameter family of distributions, the parameters being the 
mean p and the variance 07. The bell-shaped curve reaches its maximum at x = pt 
and is symmetrical about that point. A special member of the family is the 
standard normal distribution, which has zero mean and unit variance. An area 
under any specific normal distribution may be expressed as an equivalent area 
under the standard distribution by defining 


x- 
Z2= B 
o 


Clearly, E(z) = 0 and var(z) = 1, so that 


f(z) hes (A-l1) 


Then [ A(x) dx = / (2) dz 


11) are tabulated in App. B-1. 


where z, = (x, — #4)/o. The areas under Eq. (A- 
al distribution are as follows. 


Three very important results about the norm: 


ted variables are themselves normally 


1. Linear combinations of normally distribu a 
nn X 1 vector of variables, which 


distribufed.+ For example, if x denotes a 
follow the multivariate normal distribution 
x ~ N(p, 2) 


and if a vector y is defined by y = Dx where D is an m X n matrix of rank 


+ See L. D. Taylor, Probability and Mathematical Statistics, Harper and Row, New York, 1974, pp. 


154-160. 
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m <n, then 
y ~ N(Dp, DED’) 


2. Central limit theorem. If (x,, x2,..- ) is a sequence of independent random 
variables with means (1, 1),-.. ) and variances (07, 07,... ), then 


lA 


lim Pr| "<< y Safe es dz (A-12) 
T — oo 


Notice first of all that nothing is assumed about the specific forms of the 
various pdf's other than the existence of means and variances. The remark- 
able result embodied in Eq. (A-12) is that the limiting or asymptotic distribu- 
tion of the quantity D(x, — 4,)/ (Lo? is the standard normal distribution. A 
special case of the result may help to make its meaning clearer. Suppose the 
means and variances are all identical. The statistic in Eq. (A-12) then reduces 
to 


n = 


eae 
Ls nu) ae 


and the theorem states that x is asymptotically normally distributed with 
mean p and variance o7/n, 


3. Zero covariance between two normally distributed variables implies statistical 
independence. The bivariate normal distribution is 


: 1 1 erat 
(eh) Ss Ts ee x 
eis 20,0, 1 — p* o0| 2(1 — p?) { o, 


ea eneal 


Y 
where p = 9,,/0,9,. When the covariance o, , is zero, the joint pdf simplifies 
to ‘ 


roo [ ee slat 45") 


which is the product of two separate normal pdfs. Thus X and Y are 
independently distributed. 


7S. S. Wilks, Mathematical Statistics, Wiley, New York, 1962, pp. 257-258. 
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A-6 LAGRANGE MULTIPLIERS AND CONSTRAINED 
OPTIMIZATION 


Frequently in economics one has to find the maximum or minimum of a function 
subject to some constraint, or set of constraints, on the independent variables. 
Thus one may have to maximize profits subject to the constraint of the produc- 
tion function, or maximize utility subject to the constraint of the budget equation. 
In the context of the regression model, as in Chap. 6, one may need to minimize a 
residual sum of squares subject to a set of constraints on the regression coefficients. 
Such constrained optimization problems may be tackled in two alternative 
fashions. The first is to substitute the constraints into the objective function, tus 
reducing the number of independent variables and find the stationary values of 
the resultant, unrestricted function. The second is to use the method of Lagrange 
multipliers. We will illustrate with a simple example. 

Suppose the problem is to find the minimum value of y = f(x, z) = x? + z? 
subject to x + 2z = 10. Substituting the constraint in the objective function 
means that the latter may be expressed as a function of just a single variable, 
either x or z. For example, replacing x by 10 — 2z gives 


y = (10 — 22)? + 2? = 100 — 402 + 52? 

Differentiating 

dy _ ay 

at 40 + 10z gam 10 
Thus a minimum value occurs at z = 4 (x = 2), and that minimum value is 
y =120: 

Alternatively, define a new, or augmented, objective function as 
@ =x? +27-—X(x + 2z - 10) 


where A is a Lagrange multiplier, whose value is as yet unknown. So long as the 
constraint is satisfied, the term A(x + 22 — 10) vanishes, irrespective of the value 
of A, and ¢ will have the same stationary value as y. To find the stationary value 
of $ we must take the three partial derivatives and equate to zero. Thus 


Oepoy Voting # it 
Fh Ty 2x -—A=0 
Li Dea ptlde 
ae 2z-2A=0 


Oe 10 = 0 


The third equation ensures that the constraint is satisfied. Eliminating A from the 
first two gives 2x = z, which on substitution in the third gives x = 2(z = 4) and, 


as before, 
Vita = Vie = 20) 
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The solution value for \ is \ = 2x = z = 4, which in this case has no specific 
significance, There are problems, however, where A may have a meaningful 
economic interpretation. 

The technique extends to handle more than one constraint, as is evidenced by 
the examples in Chaps. 2 and 6. 


A-7 RELATIONS BETWEEN THE NORMAL, x?, 1 AND F 
DISTRIBUTIONS 


Let z ~ N(0, 1) be a standard normal variable. If n random values z,, 22,..-, 2, 
are drawn from this distribution, squared, and summed, the resultant statistic is 
said to have a x? distribution with n degrees of freedom, 


(ef 4B 4+ 422)~ eC) 


The precise mathematical form of the x? distribution need not concern us here. 
The important point is that it constitutes a one-parameter family of distributions, 
and the parameter is conventionally labeled the degrees of freedom of the 
distribution. As the degrees of freedom tend to infinity, the x? distribution 
approaches the normal density. Critical values of the x? distribution are given in 
App. B-3. 

The ¢ distribution may be defined in terms of a normal and an independent 
x? variable, Let 


z~N(0,1) and =v ~ x2(v) 
where z and v are independently distributed. Then 


we 


ove 


(A+13) 


has Student's ¢ distribution with » degrees of freedom, The ¢ distribution, like x’, 
is a one-parameter family. It is symmetrical about zero and tends asymptotically 
to the standard normal distribution. Its critical values are given in App. B-2. 

The F distribution is defined in terms of two independent x? variables, Let u 
and v be independently distributed x? variables with vy, and y, degrees of 
freedom, respectively. Then the statistic ‘ 


AAI 


a. (A-14) 


has the F distribution with (v,, v)) degrees of freedom, Critical values are given in 
App. B-4. In using the table note carefully that v, refers to the degrees of freedom 
attaching to the expression in the numerator and v, to the expression in the 
denominator. 
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If we square the expression for ¢, the result may be written 


epee) 
v/v 
where z?, being the square of a standard normal variable, has the x7(1) distribu- 
tion. Thus t? = F(1,v), that is, the square of a ¢ variable with v degrees of 
freedom is an F variable with (1, v) degrees of freedom. 

The x? variable was formed from the sum of squares of a standard normal 
variable. Suppose, however, that the z variables are still independent but distrib- 
uted as 

z,~ N(u,.1) 
The statistic z? + 2? + «++ + z2 now has the noncentral x? distribution with n 
degrees of freedom. The previous distribution is sometimes referred to as the 
central x? distribution. Corresponding to a noncentral x” distribution, there are 
noncentral ¢ and F distributions, the former arising when the v variable in Eq. 
(A-13) is noncentral and the latter when u in Eq. (A-14) is noncentral, but » is 
central. 


A-8 EXPECTATIONS IN BIVARIATE DISTRIBUTIONS 


Let X and Y be two variables with a bivariate pdf denoted by f(x, y). Let g(x, y) 
be some function of the variables.¢ The problem is to evaluate E{g(x, y)). By 
definition 


E(g(x, y)) = ffalx. y) f(x, y) dx dy 


= [f(x,y fay) fQ) ax dy (A-15) 


where f(x|y) denotes the conditional distribution of X given Y and /() denotes 
the marginal distribution of Y. Rearranging 


E{g(x, y= [fet rap) ax f(y) ay (A-16) 


The term inside the square brackets gives the expected value of g(x, y) in the 
conditional distribution f(xy), and we will denote the operation by E,,,. This 
conditional expectation is a function of Y, and it is then averaged over the 


+ For references to some tables for noncentral distributions see B. W. Lindgren, Statistical Theory, 


2d edition, Macmillan, New York, 1968, p. 383. 
+ We exclude functions g(-) which may have some values undefined such as x/0 or 0/0. 
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marginal distribution f(y). Thus we may write 
E{ a(x, y)} = E,{Ex1,8(% »)} 


Example Consider the simple bivariate distribution 


f(xy) 
y=2 ak f(x) 
x= 0.2 04 0.6 
x=2 03 0.1 04 
fy) 0.5 05 10 


Let g(x, y) = x/y. The straightforward application of Eq. (A-15) would then 
give 
e(>) = 4(0.2) + 4(0.4) + (0.3) + 3(0.1) = 0.55 


Using Eq. (A-16) we would first find E(x/y) within each of the two columns 
of the table, which contain the conditional distributions f(|y). Thus 


x 1/ 0.2 2/0.3 
e(Sb~2)-3(Gs) * alos) 728 
and 
x 1/04 2/0.1 
#( $h=4)=4(95) +4 gs) 799 


Then averaging these expectations over f(y) gives, as before, 
F(2) = 0.8(0.5) + 0.3(0.5) = 0.55 


Clearly the process is symmetrical and we could average first of all over 
each conditional distribution f(y|x) and then average the results over f(x). 
The procedure, however, would break down for the function g(x, ») = x/) if 
zero is a possible value for y. 


This result has useful applications in regression theory, but it has to be 
handled carefully. As a simple illustration consider the model 


a ‘= Bx, + u, 
where the u’s are well-behaved, the x’s are stochastic and distributed indepen- 
dently of the v’s so that, in particular, 
E(x,u,) = E(x,)E(u,)~0 — foralle 
The OLS estimator of is 


s Pett aa Ix,u, 
Lx; xe 


t 


b 
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To examine the bias of b we need to evaluate E(£x,u,/Lx?). Applying the 


theorem 
<x,u rx,u 
E{ —*) = Et E, rt)}=0 
(Ee) wi ES } 


so that b is still unbiased when x is stochastic, provided it is independent of u. 
The sampling avariance is given by 


var(b) = E{(b — B)’) = a((2#)| 


Exiu,\"| _ 07 
aig |) | Bae 


var(6) = o28{ = u } 


Now 


Thus 


Saxe 
which is the one-dimensional version of the general result given in Eq. (7-26). 


Now consider the case where x is stochastic but no longer independent of u. 
Suppose, for example, that x, u follow a bivariate normal distribution 


(4) 


= 1 
Aegis 2n0,0,)1 — p? cf - 2(1 — p? Te 


sre Cha 
oO, o, o, 
with marginal densities 


soa e0l a(S") 


-2| 


and 


flu) = aeel-i(é)] 


The conditional density for u, given x, is 


1 1 WeatOSa rs al 
hee a ee sam Al vl] 
(A-17) 


Thus 
x— Hy) 


E(xu|x) = xE(u|x) = 
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since Eq. (A-17) shows f(u|x) to be normal about mean po,(x — #,)/o,. It then 


follows that 
E Uxu;\ rane Ux,u, 
Ext x) Sulx Se 
t 1 


On dividing the top and bottom by , the term in brackets is approximately the 
ratio of the sample variance of the x observations to their sum of squares. This 
expectation does not vanish, and so b is a biased estimator, both for finite samples 
and also asymptotically, This example is a legitimate application of Eq. (A-16) 
since f(x, u) is a well-defined bivariate distribution. 

Now consider the model 


y= By, + u, (A-18) 
The OLS estimator is 
b= p+ Eyal 
Lyn 


With the u’s well-behaved there is no problem about assuming 
E(y,\u,)=0 — forallr 
An application of Eq. (A-16) would then appear to give 


To | [25 u 
Hop peat ETDS yap | (cecal 
(aed 4S( pas) -9 


However, the OLS b is well known to be biased in finite samples.+ The source of 
the error is that (y,_,,,) does not have a well-defined bivariate pdf, which 
renders the application of Eq. (A-16) invalid. Given some starting value yy, once a 
u vector is drawn, the y vector is exactly determined by Eq. (A-18). The stochastic 
behavior of y is completely determined by wu, so there is, in effect, only one 
stochastic variable. Thus Ey, ,u,/Dy2 ,) has to be evaluated solely over the u 
distribution, and the two-step procedure of Eq. (A-16) does not apply. 

A more complicated version of the same error can arise in IV estimation. 
Consider 


Y, = By,_, + yx, + u, 


Suppose (x,) is taken to be nonstochastic and x,_, is used as an instrument for 


7 J. S. White, “Asymptotic Expansions for the Mean and Variance of the Serial Correlation 
Coefficient,” Biometrika, vol. 48, 1961, pp. 85-94. 
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y,-;. The W’Z matrix of Sec. 9-2 will then include terms in Lx,y,_, and 
Lx,_1);-1. These are stochastic simply because u is stochastic and, as in the 
simple example given above, a two-stage evaluation via E,,, followed by E, is 
invalid. 


A-9 CHANGE OF VARIABLES IN DENSITY FUNCTIONS} 


The basic idea may be simply illustrated for the univariate case. Suppose u is a 
random variable with density function p(w), and suppose that a new variable y is 
defined by the relation y = f(u). The » variable must then also have a density 
function for y in terms of the density function for u and the relation y = S(u). 
Suppose the relation between y and u is monotonically increasing, as shown in 
Fig. A-3. Whenever uw lies in the interval Aw, y will be in the corresponding 
interval A y. Thus 


Pr{ y lies in Ay) = Pr{w lies in Au} 
or p(y’) Ay = p(w’) Au 


where u’ and y’ denote appropriate values of u and y in the intervals Au and Ay, 
and p(y) indicates the postulated density function for y. Taking limits as Au goes 
to zero gives 


p(y) = pw) (A-19) 


If y were a decreasing function of u, the derivative in Eq. (A-19) would be 
negative, thus giving an impossible negative value for the density function. Thus 
the absolute value of the derivative must be taken and the result reformulated to 


*u Figure A-3 


+A detailed treatment of this topic is given in L. D. Taylor, Probability and Mathematical 
Statistics, Harper and Row, New York, 1974, Chap. 10. 
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read 


p(y) = p(u)- 


a | (A-20) 


If y = f() were not a monotonic function, Eq. (A-20) would require amendment, 
but we are only concerned here with monotonic functions. 

In the multivariate case u and y now indicate yectors of, say, n variables each. 
Under suitable conditions a result similar to Eq. (A-20) still holds, namely, 


p(y) = p(u) 


du 
| (A21) 


where |du/d0y| indicates the absolute value of the determinant formed from the 
matrix of partial derivatives 


ay aya 
dy, dy, oy, 
iy uy a 
dy, ay, IY, 
iy ig TBS aca te, a. 
ay, Oy, IY, 


A-10 PRINCIPAL COMPONENTS 


Suppose we have a matrix X of n observations on k variables, 


where the observations have been expressed as deviations from the sample means, 
for we are concerned with studying the variation in the data. 

The nature of principal components may be approached in a number of ways. 
One is to ask how many dimensions there are or how much independence there 
really is in the set of k variables. More explicitly we consider the transformation 
of the X's toa new set of variables which will be pairwise uncorrelated and of 
which the first will have the maximum possible variance, the second the maximum 
possible variance among those uncorrelated with the first, and so forth. Let 

Fy Anke F GayX2, + ++ +a x, tf =1,...,0 
denote the first new variable. In matrix form 
2, = Xa, (A-22) 
where z, is an m-element vector and a, a k-clement vector, The sum of squares of 


z, 18 


Liz, = a,XXa; (A-23) 
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We wish to choose a, to maximize z/z,, but clearly some constraint must be 
imposed on aj, otherwise zz, could be made infinitely large. So let us normalize 
by setting 
aja, = 1 (A-24) 
The problem now is to maximize Eq. (A-23) subject Eq. (A-24). Define 
=a X’Ka, — A, (aja, — 1) 
where A, is a Lagrange multiplier. Thus 


29 _ >x'Xa, — 2A,8, 


da, 
Setting 
ans 
da, : 
gives 
(X’X)a, = A\a, (A-25) 


Thus a, is an eigenvector of X’X corresponding to the root ),. From Eqs. (A-23) 
and (A-25) we see that 

Z\2, = Ayaja, = Ay 
and so we must choose A, as the largest eigenvalue of X’X. The X’X matrix, in the 
absence of perfect collinearity, will be positive definite and thus haye positive 
eigenvalues. The first principal component of X is then z,. 

Now define z, = Xa,. We wish to choose a, to maximize a’, X’Xa, subject to 
aa, = 1 and aja, = 0. The reason for the second condition is that z, is to be 
uncorrelated with z,. The covariation between them is given by 

ai X’Xa, = A, aa, 
=0  ifand onlyifaja, = 0 
Define 
= a X’'Xa, — X2 (aba, 1) = w(aja,) 
where A, and p are Lagrange multipliers. 


# = 2X'Ka, — 2Aza; — Ha, = 0 


a2 
Premultiply by a 
2a, X’Xa, — p= 0 
But from 
(X’X)a, = Aja, 
a’,(X’X)a, = Aja,a, = 0 
Thus 


n=0 
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and we have 
(X’X)a, = A,a, (A-26) 
and ), should obviously be chosen as the second largest latent root of X’X. 
We can proceed in this way for each of the k roots of X’X and assemble the 
resultant vectors in the orthogonal matrix 


A=[a, a, --- a] (A-27) 
The k principal components of X are then given by the n X k matrix Z, 
Z=XA (A-28) 
Moreover, 
A, 0 0 
ZVZ=AXXA=A=|'0 A, --- O (A-29) 
at ae i, 


showing that the principal components are indeed pairwise uncorrelated and that 
their variances are given by 


zz,=N;  i=1,...,k (A-30) 


If the rank of X were r < k, k — r eigenvalues would be zero and the variation in 
the X’s could be completely expressed in terms of r independent variables. Even 
if X has full column rank, some of the \’s may be fairly close to zero so that a 
small number of principal components account for a substantial proportion of the 
variance of the X’s. The total variation in the X’s is given by 


Dx}, + Dox}, + + + Dx, = tr(X’X) 
t t t 
but 
tr(A’X’XA) = tr(X’KAA’) 
= tr(X’X) 
since AA’ = I, and so from Eq. (A-29) 


n 


li k 
YY x} = te(x’x) = DA, = 242, + ++ + 242, 


i=11=1 i=1 
Thus 
2h eee 
DAM DALE DA 
represent the proportionate contributions of each principal component to the 


total variation of the X’s, and since the components are orthogonal, these 
contributions sum to unity, 


It is sometimes difficult to attach a concrete meaning to specific principal 
components. Occasionally a suggestion may be found in the correlations of a 
component with various X°s. To find the correlation between, say, the first 
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principal component and the X variables, we proceed as follows. The vector X’z, 
gives the cross products between z, and each X variable. But 


X’z, = X’Xa, =A,a, 
Thus the correlation between X, and z, is 
Again 


ee eee (A-31) 


where a,, is the ith element in the vector a,. In general, the correlation between X, 


and z, is 
4, j/A; 


ee Pa 
t 


These correlation coefficients may also be used to show how the variations in 
each X variable may be decomposed into the contribution due to each compo- 
nent. From 


i, j=l... k (A-32) 


Z=XA 
we have 
v=<Ax' 
and X’ = AZ’ 
since A is orthogonal. 
So X’X = AZ'ZA’ 
= AAA’ 
from Eq. (A-29), and so 
n k 
Sxpa LV aga, f= lyk (A-33) 
t=l j=l 
Dividing both sides of Eq. (A-33) by Lx; gives 
ie any + ano hei aN (A-34) 
Lx, Lexi Lixi 


where the terms on the right-hand side are the squares of the correlation 
coefficients defined in Eq. (A-32). Thus the proportions of the variation in X, 
associated with the various principal components are given by 


2 iz 2 
Tis Tia9-+ +> Tik 
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and since the components are uncorrelated, these proportions sum to unity, as is 
shown by Eq. (A-34). 

A note of warning should be inserted here. The development so far has 
proceeded on the implicit assumption that the X variables are all measured in the 
same units. If not, it is difficult to attach a meaning to concepts such as the total 
variation of the X's and the partitioning of that total variation into the contribu- 
tion due to each component. It is still, of course, possible to compute the 
eigenvalues and eigenvectors of X’X even if the dimensions of the variables are 
not all the same and the correlations in Eq. (A-32) and the partitioning in Eq. 
(A-34) would still be meaningful even though the partitioning of the ‘otal 
variation in the X's would not. As an alternative, analyses are sometimes carried 
out after all the X variables have been standardized, that is, each deviation from 
the sample mean is divided by yn times the sample standard deviation of that 
variable. X’X is now the matrix of zero-order correlation coefficients of the X 
variables, The analysis can proceed from X’X as before. Now tr(X’X) = k, and 
from the development following Eq. (A-29), 


MEIGS EA, eR 


The eigenvalues and eigenvectors will in general be different from those yielded 
by unstandardized variables. We leave it as an exercise for the reader to establish 
whether the correlation coefficients in Eq. (A-32) are affected by the standardi- 
zation of the X variables, 

Empirically, then, one may compute the principal components for a given X 
matrix and see how much of the variation of the X’s is accounted for by various 
components. Frequently the intercorrelation of economic and social data means 
that a small number of components will account for a large proportion of the 
total variation, and it is desirable to have a test for judging the number of 
components to retain for further analysis. Suppose that we have computed the 
FOOLS Ay, Ags. A, and that the first r roots Ay, A,..-,A, (7 < k) seem both 
sufficiently large and sufficiently different to be retained. The question then is 
whether the remaining k ~ r roots and their associated vectors and the compo- 
nents are sufficiently alike for one to conclude that the true values are equal. A 
very approximate test is based on 


i ker 
P= (AN ee ayer tt) (A-35) 


ie Proposed test is to consider nlog.p to follow a x? distribution with } 
& —1~IXk~r+2) degrees of freedom, if the null hypothesis of equality of 
the remaining latent roots is true.¢ One hopes in practical applications that the 


number r of significantly different components to be retained is substantially less 
than the number of variables k from which the components have been computed. 


} See M. G. Kendall and A. Stuart, The Advanced Th Statistic I. 3, Griffin, London, 
1966, pp. 292-293, for details and qualifications. rahi ecalg 7 = 
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A somewhat similar result is achieved by factor analysis in which the X 
variables are specified ab initio to be linear combinations of a small number of 
independent standard normal variables (factors) plus an independent normal 
error term. From the principal component analysis we have 


Z=XA 
and hence 
X = ZA’ (A-36) 


Equations (A-36) express the X’s as exact linear combinations of the components 
with coefficients given by the elements of A. If, however, we retain less than k 
principal components, Eqs. (A-36) would have to be replaced by 


X= Z*A" +U (A-37) 


where Z* and A* denote the submatrices of Z and A giving the retained 
components and the corresponding eigenvectors, and U is a matrix of errors. 
Principal components is obviously a possible estimation method in factor analy- 
sis, but slight modifications are required to the A* coefficients to conform to the 
imposed assumption that the factors should have unit variance, Without addi- 
tional restrictions zjz,=A,, as we have seen in Eq. (A-29), When the A* 
coefficients have been adjusted, they are referred to as factor loadings. However, 
several other estimation methods are used in factor analysis, and we do not 
propose to discuss them here.} An interesting application of factor analysis is 
given by Adelman and Morris, who find that 66 percent of the variance of the 
GNP per capita in 74 underdeveloped countries associated with just four factors, 
which have in turn been based on a complex of more than 20 social and political 
variables. 

Table A-1 shows another example in which a small number of components 
effectively account for the variation in a set of data. The basic data are 11 series 
of average quarterly interest rates in the United Kingdom from the first quarter of 
1963 to the first quarter of 1969, They include various national and local 
government rates as well as commercial rates, such as those on Building Society 
deposits. The series were standardized and the second row of the table gives the 
values of A,/ZA for the first four principal components. The first principal 
component, which turned out to be effectively a simple arithmetic average of the 
standardized series, accounts for over 83 percent of the total variance and the first 
three components account for almost 97 percent. The last seven components 
account for less than 2 percent of the total variation. 


+ See J. T. Scott, Jr., “Factor Analysis and Regression,” Econometrica, vol. 34, 1966, pp. 552-562; 
M. G. Kendall and A. Stuart, op. cit, pp. 306-311; and H. H. Hyman, Modern Factor Analysis, 
University of Chicago Press, Chicago, 1960. 

+ Adelman and C. T. Morris, “Factor Analysis of the Interrelationship between Social and 
Political Variables and Per Capita Gross National Product,” Quarterly Journal of Economics, vol. 79, 
1965, pp. 555-578, 
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Table A-1 Contribution of principal components to the total variation of 
11 interest rates} 


Component 1 2 3 4 
Contribution 0.8368 0.0831 0.0482 0.0156 
Cumulative contribution 0.8368 0.9199 0.9681 0.9837 


Source; L. D, D, Price and P. Burman, Bank of England. 


An important but as yet unresolved question concerns the use of principal 
components in conventional econometric regression problems. There appear to be 
at least two possibilities that are worth distinguishing. In both we are still 
assuming that some Y variable is to be explained in terms of a set of X variables. 
In the first problem, however, the number of variables that might possibly be 
included in the X matrix on theoretical or other grounds is so large and possibly 
So intercorrelated that conventional estimation procedures would be dubious for 
lack of degrees of freedom aggravated by multicollinearity. An obvious approach 
is then to apply principal component analysis to the X variables to see whether a 
small number of components might account for a sufficiently large proportion of 
the total variation of the X’s and then to use these components as explanatory 
variables in a conventional regression with Y as the dependent variable. Some 
discussion of this topic is given in the treatment of 2SLS in Chap. 11. A possible 
variant on this approach is to retain a small number of specific important X 
variables in the final regression along with principal components determined from 
the other X variables.} This seems valid and useful, as far as it goes, and if some 
economic or social significance can be attached to specific components, so much 
the better. The second suggested use is more doubtful and requires more examina- 
tion than it has yet received.t It concerns the case where multicollinearity rather 
than ‘an excessive number of the X variables is the problem. As is well known, 
least-squares estimation of the coefficients of the X variables becomes very 
imprecise. Kendall’s suggestion is to compute the principal components of the X 
variables, discard those with low eigenvalues, Tegress Y on the retained principal 
components, and transform back from the regression coefficients on the principal 
components to obtain estimates of the coefficients of the X variables. Suppose, for 


example, that there are five X variables and we retain just two principal compo- 
nents, 


2) = @y,X, + ay)x. + +++ + as, x5 


22 = G2X, + AX + +*> + asyXx5 


} For an illustration see G, B. Pidot, Jr, “A Principal Components Analysis of the Determinants 
of Local Government Fiscal Patterns,” Review of Economics and Statistics, vol. 51, 1969, pp. 176-188. 
See M. G. Kendall, A Course in Multivariate Analysis, Griffin, London, 1957, pp. 70-74. 
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The regression of Y on Z,, z, is then 
Y =b,z, + b,2, +e 

= by(ay.x, + +++ + a5)x5) + by (ajax, +--+ + GspXs5) +e 

= (diay, + byayy)x, + +++ + (bias, + byas2)x5 + € (A-38) 
If one retained all five principal components, the coefficients of the x’s in Eq. 
(A-38) would be identical with those given by a direct regression of Y on the x’s. 
How should we decide on the number of components to retain? Purely subjective 
decision on the size of the latent roots, as in Kendall’s illustrative example, is 
hardly satisfactory. Should one use the test based on Eq. (A-35) or a conventional 
analysis of variance test on the regression? The procedure would give a nonsense 


result in the case of perfectly collinear x variables. For example, suppose 
X, = 2x, and let 


en | 
xx=[} “4 


The eigenvalues are A, = 5 and ), = 0. For A, = 5, 


Raa [ke 
“-[% é| 


and for A, = 0, 
2 1 
“lew 
The second principal component does not exist, for 
2: 1 
2, = 5 = 2 0 


since x, = 2x,. However, the first component does exist, for 


2 
Z aa eres 


and so the coefficient of z, in the regression with Y as the dependent variable can 
be computed as 


= = =— rx 
15 1572 r, 5 Wy 
Substituting in Eq. (A-38) gives 
Y=5,2, +e 
b, | b, 
= | |X 2 | xa e 
(Z v5 |? 
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and apparently the relative influence of x, and x, has been determined in a 
perfectly collinear case where such a determination is impossible. The coefficients 
on x, and x, from the principal component regression are seen to reflect simply 
the fact that x, = 2x, and are unrelated to the true but unknown parameters. 
Nevertheless the question remains whether or not the approach might work 
reasonably well in a less than perfectly collinear case.+ 


_ *For a further contribution which shows that the Principal component approach can be an 
improvement over OLS in certain circumstances see B. T. McCallum, “Artificial Orthogonalization in 
Regression Analysis,” Review of Economics and Statistics, vol, 52, 1970, pp. 110-113. The basic point 
is that the principal component estimators will be biased but will have smaller variances than the 
unbiased OLS estimators, Thus under certain conditions, the principal components estimators may 
have smaller mean-square errors than OLS estimators. 
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Table B-1 Areas of a standard normal distribution 


An entry in the table is the proportion 
under the entire curve which is between 
z =0 and a positive value of z. Areas for 
negative values of z are obtained by 


symmetry. Oz 
01 02 03 04 07 
.0040 .0160 0279 
0438 0557 0675 
0832 0948 11064 
1217 +1331 +1443 
+1591 -1700 +1808 
1950 } .1985 | .2019 | .2054 2157 
.2291 | .2324 | .2357 | .2389 +2486 
2611 | .2642 | .2673 | .2703 2794 
.2910 | .2939 | .2967 | .2995 +3078 
3186 | .3212 | .3238 | .3264 +3340 
3438 | 3461 | 3485 | .3508 3577 
.3665 | .3686 | .3708 | .3729 -3790 
3869 | 3888 | .3907 | .3925 +3980 
4049 | 4066 | .4082 | .4099 4147 
4207 | 4222 | .4236 | .4251 14292 
4345 | 4357 | 4370 | .4382 4418 
14463 | 4474 | 4484 | .4495 4525 
4564 | 4573 | .4582 | .4591 4616 
4649 | .4656 | 4664 | .4671 4693 
4719 | .4726 | .4732 | .4738 4756 
.4778 | .4783 | .4788 | .4793 4808 
4826 | .4830 | .4834 | .4838 +4850 
4864 | .4868 | 4871 | .4875 4884 
4896 | 4898 | 4901 | .4904 4911 
4920 | .4922 | .4925 | .4927 4932 
4940 | 4941 | .4943 | .4945 4949 
4955 | .4956 | .4957 | .4959 +4962 
4966 | .4967 | 4968 | .4969 4972 
4975 | 4976 | 4977 | .4977 +4979 
.4982 | 4982 | 4983 | .4984 4985 
.4987 | 4987 | .4988 | .4988 4989 


Reprinted from P. G. Hoel, Introduction to Mathematical Statistics, 4th ed., 
New York, Wiley, 1971, by permission of the publishers. 
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Table B-2 Student's ¢ distribution 


The first column lists the number of 
degrees of freedom (v). The headings of 
the other columns give probabilities (P) 
for ¢ to exceed the entry value. Use 
symmetry for negative f values. (2) t 


Reprinted from P. B. Hoel, Introduction to Mathematical Statistics, 
4th ed., New York, Wiley, 1971, by permission of the publishers. 


Table B-3 x? distribution 


degrees of freedom. 


For degrees of freedom greater than 30, the expression /2x2 -/2n-1 may be used as normal deviate with unit variance, where n is the number of 


Reprinted from R. A. Fisher, Statistical Methods for Research Workers, 14th ed., New York, Macmillan Publishing Co., Inc. 
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Table B-6 Wallis statistic for fourth-order autocorrelation 


5 percent significance points of d, , and d4 , for regressions 


without quarterly dummy variables (k = k’ + 1) 


ktel k=? k'=3 kt kiss 
Tete ee he GU ee SU 
16 0.774 0.982 0.662 1.109 0.549 1.275 0.435 1.381 0.350 1.532 
20 0.924 1.102 0.827 1.203 0.728 1.327 0,626 1.428 0.544 1.556 
24 © 1,036 1,189 0.953 1.273 0.867 1.371 0.779 1.459 0.702 1.565 
28 1.123 1.257 1.050 1.328 0.975 1.410 0,898 1.487 0,828 1.576 
32 1.192 1.311 1.127 1.373 1.061 1.443 0,993 1.511 0.929 1.587 
36 1.248 1.355 1.191 L410 1.131 1471 1,070 1.532 1.013 1.598 
40 1,295 1.392 1.243 1.442 1,190 1.496 1.135 1,550 1.082 1.609 
44° 1,335 1.423 1.288 1.469 1.239 1.518 1.189 1,567 1.141 1.620 
4B 1.369 1.451 1.326 1.493 1,281 1.537 1.236 1.582 1.191 1.630 
521,399 16475 1.359 1.513 1.318 1,554 1,276 1.595 1.235. 1.639 
56 16426 1.496 1.389 1.532 1.351 1,569 1.312 1.608 1,273 1.648 
60 1,449 1,515 15 1.548 1.379 1.583 1.343 1.619 1.307 1.656 
64 1,470 1,532 1,438 1.563 1.405 1.596 1.371 1.629 1.337 1.664 
68 1.489 1,548 1,459 1,577 1.427 1.608 1.396 1.639 1.364 1.671 
72 1,507 1.562 1.478 1,589 1.448 1.618 1.418 1.648 1.388 1.678 
76 1.522 1.574 1.495 1.601 1.467 1.628 1.439 1.656 1.411 1.685 
80 1,537 1.586 1,511 1.611 1.484 1.637 1.457 1.663 1.431 1.691 
84 1,550 1.597 1,525 1.621 1.500 1.646 1.475 1.671 1.409 1.696 
88 1,562 1.607 1.539 1.630 1.515 1.654 1.490 1.677 1.466 1.702 
92 1.574 1.617 1.551 1.639 1.528 1,661 1,505 1.684 1.482 1.707 
96 1.584 1.626 1.563 1.647 1.541 1.668 1,519 1.690 1.496 1.712 
100 1.594 1,634 1.573 1.654 1,552 1,674 1.531 1.695 1.510 1.717 


5 percent significance points of d,., and dg ,, for regressions including 


a constant term and quarterly dummy variables (k = k” + 4) 


ke 


1 


kiss kta kM=5 
DF eae fy. fu dat au LSU 
16 1.156 1.381 0.902 1.776 0.777 2.191 0.693 2.238 
20 = (1.228 1.428 1.013 1.726 0,899 1.954 0.806 2.042 
24 1,287 1.459 1.107 1.694 1.011 1.856 0,928 1.949 
28 1.337 1.487 1.181 1.679 1.099 1.803 1.025 1.889 
32 1.379 1.511 1.243 1.673 L171 1,773 1,104 1.850 
36 G14 1,532 1.293 1.672 1.230 1.755 1.170 1,624 
40 1,445 1.550 1.336 1.674 1.279 1.745 1,225 1.807 
44 1.471 1,567 1.373 1.677 1,321 1.739 1.272 1.795 
4B 1,494 1,582 1,404 1.681 1.357 1.737 1.312 1.788 
52 1.514 1,595 1.432 1.686 1,389 1.736 1.347 1.782 
56 1,533 1.608 1.456 1.691 1.416 1.736 1.377 1.779 
60 1,549 1.619 1.478 1.696 1.441 1,737 1.404 1.777 
64 1.564 1,629 1.497 1.700 1.463 1.739 1.429 1.776 
68 1.577 1.639 L515 1.705 1.482 1.741 1.450 1.775 
72 1,590 1.648 1.531 1.710 1,500 1.743 1.470 1.776 
76 1.601 1,656 1,545 1.714 1517 1.766 1.488 1.776 
80 1.611 1.663 1.559 1,719 1.531 1.748 1.504 1.777 
84 1.621 1.671 1571 1.723 1.545 1,751 1.519 1,778 
88 1.630 1.677 1.582 1,727 1.558 1.753 1,533 1,779 
92 1,639 1.684 1,593 1.731 1.570 1.756 1.546 1.781 
96 1.647 1.690 1,603 1.735 1,580 1,759 1.558 1,782 
100 1.654 1.695 1.612 1.739 1.591 1.761 1.569 1.784 


Reprinted by permission from Econometrica, vol. 40, no. 0, 1972, pp. 


623-625. 
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5 percent, | percent, and .1 percent points of the modified Von Neumann ratio 


Degrees Degrees 
of S85 [Kyat JS IMS 1% of SEYRIRY SEROO SHNEMIN Jin 
Freedom Freedom 
} 
One-tailed test One-tailed test One-tailed teat One-tailed test 
against positive against negative against positive against negative 
autocorrelation autocorrelation autocorrelation 
3 2.595 2,826 3.066 
2 025 .001 .000 | 3,975 3.999 4.000 32 2.585 2.813 3.051 
3 +252 052.005 | 4.142 4.427 4.493 3 2.576 2.801 3.036 
4 ATA 170.037 | 3.827 4.295 4.496 34 2.567 2.789 3.021 
5 598.292 095 | 3.571 4.076 4.378 35 2.559 2.778 3.007 
6 -701 386.163 | 3.613 3.881 4.233 36 1.452 1.241 1,022 | 2.551 2.767 2.994 
7 +790 464.228 | 3.299 3.731 4.095 37 1.460 1.251 1.034 | 2.544 2.757 2.962 
8 +861 537.285 | 3.206 3.618 3.973 38 1.467 1.261 1.045 | 2.536 2.747 2.969 
9 1922 601.339 | 3.131 3.526 3.871 39 1.474 1.270 1.057 | 2.529 2.738 2.957 
10 +975 657.390 | 3.069 3.445 3.786 40 1,480 1.279 1.067 | 2.522 2.729 2.946 
4 1.020.708 .438 | 3.016 3.378 3.710 4l 1.487 1.287 1.078 | 2.516 2.720 2.935 
12 1.060.753 482 | 2.970 3.319 3.645 42 1.493 1.295 1.088 | 2.510 2.711 2.925 
13 1.096.795 .523 | 2.930 3.268 3.587 43 1.499 1.303 1.097 | 2.504 2.703 2.914 
14 1.128.832 561 | 2.895 3.222 3.535 44 1.504 1.311 1.107 | 2.498 2.695 2.904 
15 1.157 866.597 | 2.863 3.181 3.488 45 1.510 1.318 1.116 | 2.492 2.687 2.895 
16 1.183 898.630 | 2.835 3.144 3.445 46 1.515 1.325 1.125 | 2.487 2.680 2.885 
17 1.207.927 .661 | 2.809 3.110 3.406 47 1.520 1.332 1.133 | 2.482 2.673 2.876 
18 1.228.954 691 | 2.785 3.079 3.370 48 1.525 1.339 1.142 | 2.477 2,666 2.868 
19 1,249 .979  .718 | 2.764 3.051 3.337 49 1.530 1.346 1.150 | 2.472 2.659 2,859 
20 1.267 1.003.744 | 2.744 3,025 3.306 50 1.535 1.352 1.158 | 2.467 2.653 2.851 
21 1,285 1.024 .769 | 2.725 3.000 3.277 SI 1,540 1.358 1.165 | 2.462 2.646 2.843 
22 1.301 1.045 .792 | 2.708 2.978 3.250 52 1.544 1.364 1.173 | 2.458 2.640 2.835 
23 1.316 1.064 .814 | 2.692 2.957 3.225 53 1.548 1.370 1.180 | 2.453 2.634 2.828 
2 1.330 1,082 .834 | 2.677 2.937 3.201 54 1.552 1.376 1.187 | 2.449 2.628 2.820 
25 1.344 1.100.854 | 2.663 2.918 3.179 55 1.557 1.381 1.194 | 2.445 2.623 2.813 
26 1,356 1.116 873 | 2.650 2.901 3.157 56 1.561 1.387 1.201 | 2.441 2.617 2.806 
27 1.368 1.131.891 | 2.638 2.884 3.137 57 1.564 1.392 1.207 | 2.437 2.612 2.799 
28 1,380 1.146 .908 | 2.626 2.868 3.118 58 1.568 1.397 1.214 | 2.433 2.606 2.793 
29 1.390 1.160 .925 | 2.615 2.854 3.100 59 1.572 1.402 1.220 | 2.429 2.601 2.786 
30 1.400 1.173  .940 | 2.605 2.839 3.083 60 1.575 1.407 1.226 | 2.426 2.596 2.780 


Reprinted by permission of S. J. Press and R. B. Brooks from Report No. 6911, Center for 


Mathematical Studies in Business and Economics, University of Chicago, Chicago, 1969. 


560 ECONOMETRIC METHODS 


Table B-8 Significance values for co in the cusum of squares test 


po Bt whet ietmes ent ios pa slash 


0-005 m 0-10 0-05 0-025 0-01 0-005 


0.49500 41 0.14916 0.17215 0.19254 0.21667 .233310 


59596 42 +17034 — .19050 23081 
5790043 16858-18852 22839 
5021044 +1688 18661 +2605 
157645 31652418475 22377 
48988 46 «16364 .18295 22157 
46761 AT +16208 — .18120 -21943 
40819 4B +16058  .17950 221735 
ASO7L 49 15911417785 121534 
AI517 50 115769 417624 121337 
40122, SI +15630 517468 21146 
38856 52 1549517316 -20961 
3770353 -20780 
336649 5h 20604 
3567955 -20432 
34784656 4.20265 
3395357 -20101 
33181 38 +1982 
3205959 19786 
3178460 118245 «19635 
349 62 1797319341 
130552, 64 A713 419061 
2998966 17464 —.18792 
29056 68 «13728 415329417226, 18535 
28951 70 13548415127 .16997.18288 
2007272 13375414932 ,16777—.18051 
20016 74 +13208 414745 16566 «17823 
2758216 113048414565 16363 .17604 
276878 12894414392 416167 «17392 
2677280 12745 414224415978 417188 
26395 82 12601-14063 .15795 £16992 
26030 BA 112462 1390715619 16802 
-25683 86 112327413756 15449 £16618 
125308 BB 11219713610 «1528416440 
2502790) 12071413468 15124 «16268 
2071892 111949 

m2; 94 A183) 
2013496 S1I716 
12385798 s11608 
123589 100 96 

Values for odd n greater than 650 are available from the author on request. 


The values of cy are uused to determine the pair of lines, s, = +¢y + (r — k)/(n — k). For n 
observations, & explanatory variables (including the intercept, if there is one) and a given significance 
level pede found by entering the table at m= 4(n — k) — 1 and 4a, For a one-sided test, enter at 
m= 3("—k)~ | and a. When (n ~ k) is odd, the procedure suggested is to interpolate linearly 


between m= }(n~k)~} and m=4(n~ ky —!. 


Reprinted by permission of the Biometrika Trustees from Biometrika, vol. 56, 1969, p. 4. 
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